1. 06 Jun, 2014 1 commit
  2. 02 Apr, 2014 1 commit
  3. 01 Apr, 2014 2 commits
  4. 09 Nov, 2013 2 commits
    • J. Bruce Fields's avatar
      locks: break delegations on rename · 8e6d782c
      J. Bruce Fields authored
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      8e6d782c
    • J. Bruce Fields's avatar
      locks: break delegations on unlink · b21996e3
      J. Bruce Fields authored
      We need to break delegations on any operation that changes the set of
      links pointing to an inode.  Start with unlink.
      
      Such operations also hold the i_mutex on a parent directory.  Breaking a
      delegation may require waiting for a timeout (by default 90 seconds) in
      the case of a unresponsive NFS client.  To avoid blocking all directory
      operations, we therefore drop locks before waiting for the delegation.
      The logic then looks like:
      
      	acquire locks
      	...
      	test for delegation; if found:
      		take reference on inode
      		release locks
      		wait for delegation break
      		drop reference on inode
      		retry
      
      It is possible this could never terminate.  (Even if we take precautions
      to prevent another delegation being acquired on the same inode, we could
      get a different inode on each retry.)  But this seems very unlikely.
      
      The initial test for a delegation happens after the lock on the target
      inode is acquired, but the directory inode may have been acquired
      further up the call stack.  We therefore add a "struct inode **"
      argument to any intervening functions, which we use to pass the inode
      back up to the caller in the case it needs a delegation synchronously
      broken.
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b21996e3
  5. 20 Sep, 2013 1 commit
  6. 19 Jun, 2013 3 commits
    • David Howells's avatar
      FS-Cache: Fix object state machine to have separate work and wait states · caaef690
      David Howells authored
      Fix object state machine to have separate work and wait states as that makes
      it easier to envision.
      
      There are now three kinds of state:
      
       (1) Work state.  This is an execution state.  No event processing is performed
           by a work state.  The function attached to a work state returns a pointer
           indicating the next state to which the OSM should transition.  Returning
           NO_TRANSIT repeats the current state, but goes back to the scheduler
           first.
      
       (2) Wait state.  This is an event processing state.  No execution is
           performed by a wait state.  Wait states are just tables of "if event X
           occurs, clear it and transition to state Y".  The dispatcher returns to
           the scheduler if none of the events in which the wait state has an
           interest are currently pending.
      
       (3) Out-of-band state.  This is a special work state.  Transitions to normal
           states can be overridden when an unexpected event occurs (eg. I/O error).
           Instead the dispatcher disables and clears the OOB event and transits to
           the specified work state.  This then acts as an ordinary work state,
           though object->state points to the overridden destination.  Returning
           NO_TRANSIT resumes the overridden transition.
      
      In addition, the states have names in their definitions, so there's no need for
      tables of state names.  Further, the EV_REQUEUE event is no longer necessary as
      that is automatic for work states.
      
      Since the states are now separate structs rather than values in an enum, it's
      not possible to use comparisons other than (non-)equality between them, so use
      some object->flags to indicate what phase an object is in.
      
      The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one
      (EV_KILL).  An object flag now carries the information about retirement.
      
      Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged
      into an KILL_OBJECT state and additional states have been added for handling
      waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS).
      
      A state has also been added for synchronising with parent object initialisation
      (WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY).
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-By: default avatarMilosz Tanski <milosz@adfin.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      caaef690
    • David Howells's avatar
      FS-Cache: Wrap checks on object state · 493f7bc1
      David Howells authored
      Wrap checks on object state (mostly outside of fs/fscache/object.c) with
      inline functions so that the mechanism can be replaced.
      
      Some of the state checks within object.c are left as-is as they will be
      replaced.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-By: default avatarMilosz Tanski <milosz@adfin.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      493f7bc1
    • J. Bruce Fields's avatar
      CacheFiles: name i_mutex lock class explicitly · 6bd5e82b
      J. Bruce Fields authored
      Just some cleanup.
      
      (And note the caller of this function may, for example, call vfs_unlink
      on a child, so the "1" (I_MUTEX_PARENT) really was what was intended
      here.)
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-By: default avatarMilosz Tanski <milosz@adfin.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      6bd5e82b
  7. 20 Dec, 2012 1 commit
  8. 14 Jul, 2012 1 commit
  9. 21 Mar, 2012 1 commit
  10. 23 Jan, 2011 1 commit
  11. 22 Jul, 2010 1 commit
    • Tejun Heo's avatar
      fscache: convert object to use workqueue instead of slow-work · 8b8edefa
      Tejun Heo authored
      Make fscache object state transition callbacks use workqueue instead
      of slow-work.  New dedicated unbound CPU workqueue fscache_object_wq
      is created.  get/put callbacks are renamed and modified to take
      @object and called directly from the enqueue wrapper and the work
      function.  While at it, make all open coded instances of get/put to
      use fscache_get/put_object().
      
      * Unbound workqueue is used.
      
      * work_busy() output is printed instead of slow-work flags in object
        debugging outputs.  They mean basically the same thing bit-for-bit.
      
      * sysctl fscache.object_max_active added to control concurrency.  The
        default value is nr_cpus clamped between 4 and
        WQ_UNBOUND_MAX_ACTIVE.
      
      * slow_work_sleep_till_thread_needed() is replaced with fscache
        private implementation fscache_object_sleep_till_congested() which
        waits on fscache_object_wq congestion.
      
      * debugfs support is dropped for now.  Tracing API based debug
        facility is planned to be added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      8b8edefa
  12. 11 May, 2010 1 commit
    • David Howells's avatar
      CacheFiles: Fix occasional EIO on call to vfs_unlink() · c61ea31d
      David Howells authored
      Fix an occasional EIO returned by a call to vfs_unlink():
      
      	[ 4868.465413] CacheFiles: I/O Error: Unlink failed
      	[ 4868.465444] FS-Cache: Cache cachefiles stopped due to I/O error
      	[ 4947.320011] CacheFiles: File cache on md3 unregistering
      	[ 4947.320041] FS-Cache: Withdrawing cache "mycache"
      	[ 5127.348683] FS-Cache: Cache "mycache" added (type cachefiles)
      	[ 5127.348716] CacheFiles: File cache on md3 registered
      	[ 7076.871081] CacheFiles: I/O Error: Unlink failed
      	[ 7076.871130] FS-Cache: Cache cachefiles stopped due to I/O error
      	[ 7116.780891] CacheFiles: File cache on md3 unregistering
      	[ 7116.780937] FS-Cache: Withdrawing cache "mycache"
      	[ 7296.813394] FS-Cache: Cache "mycache" added (type cachefiles)
      	[ 7296.813432] CacheFiles: File cache on md3 registered
      
      What happens is this:
      
       (1) A cached NFS file is seen to have become out of date, so NFS retires the
           object and immediately acquires a new object with the same key.
      
       (2) Retirement of the old object is done asynchronously - so the lookup/create
           to generate the new object may be done first.
      
           This can be a problem as the old object and the new object must exist at
           the same point in the backing filesystem (i.e. they must have the same
           pathname).
      
       (3) The lookup for the new object sees that a backing file already exists,
           checks to see whether it is valid and sees that it isn't.  It then deletes
           that file and creates a new one on disk.
      
       (4) The retirement phase for the old file is then performed.  It tries to
           delete the dentry it has, but ext4_unlink() returns -EIO because the inode
           attached to that dentry no longer matches the inode number associated with
           the filename in the parent directory.
      
      The trace below shows this quite well.
      
      	[md5sum] ==> __fscache_relinquish_cookie(ffff88002d12fb58{NFS.fh,ffff88002ce62100},1)
      	[md5sum] ==> __fscache_acquire_cookie({NFS.server},{NFS.fh},ffff88002ce62100)
      
      NFS has retired the old cookie and asked for a new one.
      
      	[kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_ACTIVE,24})
      	[kslowd] <== fscache_object_state_machine() [->OBJECT_DYING]
      	[kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_INIT,0})
      	[kslowd] <== fscache_object_state_machine() [->OBJECT_LOOKING_UP]
      	[kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_DYING,24})
      	[kslowd] <== fscache_object_state_machine() [->OBJECT_RECYCLING]
      
      The old object (OBJ52) is going through the terminal states to get rid of it,
      whilst the new object - (OBJ53) - is coming into being.
      
      	[kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_LOOKING_UP,0})
      	[kslowd] ==> cachefiles_walk_to_object({ffff88003029d8b8},OBJ53,@68,)
      	[kslowd] lookup '@68'
      	[kslowd] next -> ffff88002ce41bd0 positive
      	[kslowd] advance
      	[kslowd] lookup 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA'
      	[kslowd] next -> ffff8800369faac8 positive
      
      The new object has looked up the subdir in which the file would be in (getting
      dentry ffff88002ce41bd0) and then looked up the file itself (getting dentry
      ffff8800369faac8).
      
      	[kslowd] validate 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA'
      	[kslowd] ==> cachefiles_bury_object(,'@68','Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA')
      	[kslowd] remove ffff8800369faac8 from ffff88002ce41bd0
      	[kslowd] unlink stale object
      	[kslowd] <== cachefiles_bury_object() = 0
      
      It then checks the file's xattrs to see if it's valid.  NFS says that the
      auxiliary data indicate the file is out of date (obvious to us - that's why NFS
      ditched the old version and got a new one).  CacheFiles then deletes the old
      file (dentry ffff8800369faac8).
      
      	[kslowd] redo lookup
      	[kslowd] lookup 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA'
      	[kslowd] next -> ffff88002cd94288 negative
      	[kslowd] create -> ffff88002cd94288{ffff88002cdaf238{ino=148247}}
      
      CacheFiles then redoes the lookup and gets a negative result in a new dentry
      (ffff88002cd94288) which it then creates a file for.
      
      	[kslowd] ==> cachefiles_mark_object_active(,OBJ53)
      	[kslowd] <== cachefiles_mark_object_active() = 0
      	[kslowd] === OBTAINED_OBJECT ===
      	[kslowd] <== cachefiles_walk_to_object() = 0 [148247]
      	[kslowd] <== fscache_object_state_machine() [->OBJECT_AVAILABLE]
      
      The new object is then marked active and the state machine moves to the
      available state - at which point NFS can start filling the object.
      
      	[kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_RECYCLING,20})
      	[kslowd] ==> fscache_release_object()
      	[kslowd] ==> cachefiles_drop_object({OBJ52,2})
      	[kslowd] ==> cachefiles_delete_object(,OBJ52{ffff8800369faac8})
      
      The old object, meanwhile, goes on with being retired.  If allocation occurs
      first, cachefiles_delete_object() has to wait for dir->d_inode->i_mutex to
      become available before it can continue.
      
      	[kslowd] ==> cachefiles_bury_object(,'@68','Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA')
      	[kslowd] remove ffff8800369faac8 from ffff88002ce41bd0
      	[kslowd] unlink stale object
      	EXT4-fs warning (device sda6): ext4_unlink: Inode number mismatch in unlink (148247!=148193)
      	CacheFiles: I/O Error: Unlink failed
      	FS-Cache: Cache cachefiles stopped due to I/O error
      
      CacheFiles then tries to delete the file for the old object, but the dentry it
      has (ffff8800369faac8) no longer points to a valid inode for that directory
      entry, and so ext4_unlink() returns -EIO when de->inode does not match i_ino.
      
      	[kslowd] <== cachefiles_bury_object() = -5
      	[kslowd] <== cachefiles_delete_object() = -5
      	[kslowd] <== fscache_object_state_machine() [->OBJECT_DEAD]
      	[kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_AVAILABLE,0})
      	[kslowd] <== fscache_object_state_machine() [->OBJECT_ACTIVE]
      
      (Note that the above trace includes extra information beyond that produced by
      the upstream code).
      
      The fix is to note when an object that is being retired has had its object
      deleted preemptively by a replacement object that is being created, and to
      skip the second removal attempt in such a case.
      Reported-by: default avatarGreg M <gregm@servu.net.au>
      Reported-by: default avatarMark Moseley <moseleymark@gmail.com>
      Reported-by: default avatarRomain DEGEZ <romain.degez@smartjog.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c61ea31d
  13. 30 Mar, 2010 1 commit
    • Tejun Heo's avatar
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo authored
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Guess-its-ok-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  14. 20 Feb, 2010 1 commit
    • David Howells's avatar
      CacheFiles: Fix a race in cachefiles_delete_object() vs rename · 8f9941ae
      David Howells authored
      cachefiles_delete_object() can race with rename.  It gets the parent directory
      of the object it's asked to delete, then locks it - but rename may have changed
      the object's parent between the get and the completion of the lock.
      
      However, if such a circumstance is detected, we abandon our attempt to delete
      the object - since it's no longer in the index key path, it won't be seen
      again by lookups of that key.  The assumption is that cachefilesd may have
      culled it by renaming it to the graveyard for later destruction.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      8f9941ae
  15. 19 Nov, 2009 3 commits
    • David Howells's avatar
      CacheFiles: Catch an overly long wait for an old active object · fee096de
      David Howells authored
      Catch an overly long wait for an old, dying active object when we want to
      replace it with a new one.  The probability is that all the slow-work threads
      are hogged, and the delete can't get a look in.
      
      What we do instead is:
      
       (1) if there's nothing in the slow work queue, we sleep until either the dying
           object has finished dying or there is something in the slow work queue
           behind which we can queue our object.
      
       (2) if there is something in the slow work queue, we return ETIMEDOUT to
           fscache_lookup_object(), which then puts us back on the slow work queue,
           presumably behind the deletion that we're blocked by.  We are then
           deferred for a while until we work our way back through the queue -
           without blocking a slow-work thread unnecessarily.
      
      A backtrace similar to the following may appear in the log without this patch:
      
      	INFO: task kslowd004:5711 blocked for more than 120 seconds.
      	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      	kslowd004     D 0000000000000000     0  5711      2 0x00000080
      	 ffff88000340bb80 0000000000000046 ffff88002550d000 0000000000000000
      	 ffff88002550d000 0000000000000007 ffff88000340bfd8 ffff88002550d2a8
      	 000000000000ddf0 00000000000118c0 00000000000118c0 ffff88002550d2a8
      	Call Trace:
      	 [<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf
      	 [<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
      	 [<ffffffffa011c4e1>] cachefiles_wait_bit+0x9/0xd [cachefiles]
      	 [<ffffffff81353153>] __wait_on_bit+0x43/0x76
      	 [<ffffffff8111ae39>] ? ext3_xattr_get+0x1ec/0x270
      	 [<ffffffff813531ef>] out_of_line_wait_on_bit+0x69/0x74
      	 [<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
      	 [<ffffffff8104c125>] ? wake_bit_function+0x0/0x2e
      	 [<ffffffffa011bc79>] cachefiles_mark_object_active+0x203/0x23b [cachefiles]
      	 [<ffffffffa011c209>] cachefiles_walk_to_object+0x558/0x827 [cachefiles]
      	 [<ffffffffa011a429>] cachefiles_lookup_object+0xac/0x12a [cachefiles]
      	 [<ffffffffa00aa1e9>] fscache_lookup_object+0x1c7/0x214 [fscache]
      	 [<ffffffffa00aafc5>] fscache_object_state_machine+0xa5/0x52d [fscache]
      	 [<ffffffffa00ab4ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
      	 [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
      	 [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
      	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
      	 [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
      	 [<ffffffff8104be91>] kthread+0x7a/0x82
      	 [<ffffffff8100beda>] child_rip+0xa/0x20
      	 [<ffffffff8100b87c>] ? restore_args+0x0/0x30
      	 [<ffffffff8104be17>] ? kthread+0x0/0x82
      	 [<ffffffff8100bed0>] ? child_rip+0x0/0x20
      	1 lock held by kslowd004/5711:
      	 #0:  (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffffa011be64>] cachefiles_walk_to_object+0x1b3/0x827 [cachefiles]
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      fee096de
    • David Howells's avatar
      CacheFiles: Better showing of debugging information in active object problems · d0e27b78
      David Howells authored
      Show more debugging information if cachefiles_mark_object_active() is asked to
      activate an active object.
      
      This may happen, for instance, if the netfs tries to register an object with
      the same key multiple times.
      
      The code is changed to (a) get the appropriate object lock to protect the
      cookie pointer whilst we dereference it, and (b) get and display the cookie key
      if available.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      d0e27b78
    • David Howells's avatar
      CacheFiles: Mark parent directory locks as I_MUTEX_PARENT to keep lockdep happy · 6511de33
      David Howells authored
      Mark parent directory locks as I_MUTEX_PARENT in the callers of
      cachefiles_bury_object() so that lockdep doesn't complain when that invokes
      vfs_unlink():
      
      =============================================
      [ INFO: possible recursive locking detected ]
      2.6.32-rc6-cachefs #47
      ---------------------------------------------
      kslowd002/3089 is trying to acquire lock:
       (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffff810bbf72>] vfs_unlink+0x8b/0x128
      
      but task is already holding lock:
       (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffffa00e4e61>] cachefiles_walk_to_object+0x1b0/0x831 [cachefiles]
      
      other info that might help us debug this:
      1 lock held by kslowd002/3089:
       #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffffa00e4e61>] cachefiles_walk_to_object+0x1b0/0x831 [cachefiles]
      
      stack backtrace:
      Pid: 3089, comm: kslowd002 Not tainted 2.6.32-rc6-cachefs #47
      Call Trace:
       [<ffffffff8105ad7b>] __lock_acquire+0x1649/0x16e3
       [<ffffffff8118170e>] ? inode_has_perm+0x5f/0x61
       [<ffffffff8105ae6c>] lock_acquire+0x57/0x6d
       [<ffffffff810bbf72>] ? vfs_unlink+0x8b/0x128
       [<ffffffff81353ac3>] mutex_lock_nested+0x54/0x292
       [<ffffffff810bbf72>] ? vfs_unlink+0x8b/0x128
       [<ffffffff8118179e>] ? selinux_inode_permission+0x8e/0x90
       [<ffffffff8117e271>] ? security_inode_permission+0x1c/0x1e
       [<ffffffff810bb4fb>] ? inode_permission+0x99/0xa5
       [<ffffffff810bbf72>] vfs_unlink+0x8b/0x128
       [<ffffffff810adb19>] ? kfree+0xed/0xf9
       [<ffffffffa00e3f00>] cachefiles_bury_object+0xb6/0x420 [cachefiles]
       [<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf
       [<ffffffffa00e7e24>] ? cachefiles_check_object_xattr+0x233/0x293 [cachefiles]
       [<ffffffffa00e51b0>] cachefiles_walk_to_object+0x4ff/0x831 [cachefiles]
       [<ffffffff81032238>] ? finish_task_switch+0x0/0xb2
       [<ffffffffa00e3429>] cachefiles_lookup_object+0xac/0x12a [cachefiles]
       [<ffffffffa00741e9>] fscache_lookup_object+0x1c7/0x214 [fscache]
       [<ffffffffa0074fc5>] fscache_object_state_machine+0xa5/0x52d [fscache]
       [<ffffffffa00754ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
       [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
       [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
       [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
       [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
       [<ffffffff8104be91>] kthread+0x7a/0x82
       [<ffffffff8100beda>] child_rip+0xa/0x20
       [<ffffffff8100b87c>] ? restore_args+0x0/0x30
       [<ffffffff8104be17>] ? kthread+0x0/0x82
       [<ffffffff8100bed0>] ? child_rip+0x0/0x20
      Signed-off-by: default avatarDaivd Howells <dhowells@redhat.com>
      6511de33
  16. 03 Apr, 2009 1 commit
    • David Howells's avatar
      CacheFiles: A cache that backs onto a mounted filesystem · 9ae326a6
      David Howells authored
      Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a
      backing store for the cache.
      
      CacheFiles uses a userspace daemon to do some of the cache management - such as
      reaping stale nodes and culling.  This is called cachefilesd and lives in
      /sbin.  The source for the daemon can be downloaded from:
      
      	http://people.redhat.com/~dhowells/cachefs/cachefilesd.c
      
      And an example configuration from:
      
      	http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf
      
      The filesystem and data integrity of the cache are only as good as those of the
      filesystem providing the backing services.  Note that CacheFiles does not
      attempt to journal anything since the journalling interfaces of the various
      filesystems are very specific in nature.
      
      CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
      to communication with the daemon.  Only one thing may have this open at once,
      and whilst it is open, a cache is at least partially in existence.  The daemon
      opens this and sends commands down it to control the cache.
      
      CacheFiles is currently limited to a single cache.
      
      CacheFiles attempts to maintain at least a certain percentage of free space on
      the filesystem, shrinking the cache by culling the objects it contains to make
      space if necessary - see the "Cache Culling" section.  This means it can be
      placed on the same medium as a live set of data, and will expand to make use of
      spare space and automatically contract when the set of data requires more
      space.
      
      ============
      REQUIREMENTS
      ============
      
      The use of CacheFiles and its daemon requires the following features to be
      available in the system and in the cache filesystem:
      
      	- dnotify.
      
      	- extended attributes (xattrs).
      
      	- openat() and friends.
      
      	- bmap() support on files in the filesystem (FIBMAP ioctl).
      
      	- The use of bmap() to detect a partial page at the end of the file.
      
      It is strongly recommended that the "dir_index" option is enabled on Ext3
      filesystems being used as a cache.
      
      =============
      CONFIGURATION
      =============
      
      The cache is configured by a script in /etc/cachefilesd.conf.  These commands
      set up cache ready for use.  The following script commands are available:
      
       (*) brun <N>%
       (*) bcull <N>%
       (*) bstop <N>%
       (*) frun <N>%
       (*) fcull <N>%
       (*) fstop <N>%
      
      	Configure the culling limits.  Optional.  See the section on culling
      	The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
      
      	The commands beginning with a 'b' are file space (block) limits, those
      	beginning with an 'f' are file count limits.
      
       (*) dir <path>
      
      	Specify the directory containing the root of the cache.  Mandatory.
      
       (*) tag <name>
      
      	Specify a tag to FS-Cache to use in distinguishing multiple caches.
      	Optional.  The default is "CacheFiles".
      
       (*) debug <mask>
      
      	Specify a numeric bitmask to control debugging in the kernel module.
      	Optional.  The default is zero (all off).  The following values can be
      	OR'd into the mask to collect various information:
      
      		1	Turn on trace of function entry (_enter() macros)
      		2	Turn on trace of function exit (_leave() macros)
      		4	Turn on trace of internal debug points (_debug())
      
      	This mask can also be set through sysfs, eg:
      
      		echo 5 >/sys/modules/cachefiles/parameters/debug
      
      ==================
      STARTING THE CACHE
      ==================
      
      The cache is started by running the daemon.  The daemon opens the cache device,
      configures the cache and tells it to begin caching.  At that point the cache
      binds to fscache and the cache becomes live.
      
      The daemon is run as follows:
      
      	/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
      
      The flags are:
      
       (*) -d
      
      	Increase the debugging level.  This can be specified multiple times and
      	is cumulative with itself.
      
       (*) -s
      
      	Send messages to stderr instead of syslog.
      
       (*) -n
      
      	Don't daemonise and go into background.
      
       (*) -f <configfile>
      
      	Use an alternative configuration file rather than the default one.
      
      ===============
      THINGS TO AVOID
      ===============
      
      Do not mount other things within the cache as this will cause problems.  The
      kernel module contains its own very cut-down path walking facility that ignores
      mountpoints, but the daemon can't avoid them.
      
      Do not create, rename or unlink files and directories in the cache whilst the
      cache is active, as this may cause the state to become uncertain.
      
      Renaming files in the cache might make objects appear to be other objects (the
      filename is part of the lookup key).
      
      Do not change or remove the extended attributes attached to cache files by the
      cache as this will cause the cache state management to get confused.
      
      Do not create files or directories in the cache, lest the cache get confused or
      serve incorrect data.
      
      Do not chmod files in the cache.  The module creates things with minimal
      permissions to prevent random users being able to access them directly.
      
      =============
      CACHE CULLING
      =============
      
      The cache may need culling occasionally to make space.  This involves
      discarding objects from the cache that have been used less recently than
      anything else.  Culling is based on the access time of data objects.  Empty
      directories are culled if not in use.
      
      Cache culling is done on the basis of the percentage of blocks and the
      percentage of files available in the underlying filesystem.  There are six
      "limits":
      
       (*) brun
       (*) frun
      
           If the amount of free space and the number of available files in the cache
           rises above both these limits, then culling is turned off.
      
       (*) bcull
       (*) fcull
      
           If the amount of available space or the number of available files in the
           cache falls below either of these limits, then culling is started.
      
       (*) bstop
       (*) fstop
      
           If the amount of available space or the number of available files in the
           cache falls below either of these limits, then no further allocation of
           disk space or files is permitted until culling has raised things above
           these limits again.
      
      These must be configured thusly:
      
      	0 <= bstop < bcull < brun < 100
      	0 <= fstop < fcull < frun < 100
      
      Note that these are percentages of available space and available files, and do
      _not_ appear as 100 minus the percentage displayed by the "df" program.
      
      The userspace daemon scans the cache to build up a table of cullable objects.
      These are then culled in least recently used order.  A new scan of the cache is
      started as soon as space is made in the table.  Objects will be skipped if
      their atimes have changed or if the kernel module says it is still using them.
      
      ===============
      CACHE STRUCTURE
      ===============
      
      The CacheFiles module will create two directories in the directory it was
      given:
      
       (*) cache/
      
       (*) graveyard/
      
      The active cache objects all reside in the first directory.  The CacheFiles
      kernel module moves any retired or culled objects that it can't simply unlink
      to the graveyard from which the daemon will actually delete them.
      
      The daemon uses dnotify to monitor the graveyard directory, and will delete
      anything that appears therein.
      
      The module represents index objects as directories with the filename "I..." or
      "J...".  Note that the "cache/" directory is itself a special index.
      
      Data objects are represented as files if they have no children, or directories
      if they do.  Their filenames all begin "D..." or "E...".  If represented as a
      directory, data objects will have a file in the directory called "data" that
      actually holds the data.
      
      Special objects are similar to data objects, except their filenames begin
      "S..." or "T...".
      
      If an object has children, then it will be represented as a directory.
      Immediately in the representative directory are a collection of directories
      named for hash values of the child object keys with an '@' prepended.  Into
      this directory, if possible, will be placed the representations of the child
      objects:
      
      	INDEX     INDEX      INDEX                             DATA FILES
      	========= ========== ================================= ================
      	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
      	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
      	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
      	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
      
      If the key is so long that it exceeds NAME_MAX with the decorations added on to
      it, then it will be cut into pieces, the first few of which will be used to
      make a nest of directories, and the last one of which will be the objects
      inside the last directory.  The names of the intermediate directories will have
      '+' prepended:
      
      	J1223/@23/+xy...z/+kl...m/Epqr
      
      Note that keys are raw data, and not only may they exceed NAME_MAX in size,
      they may also contain things like '/' and NUL characters, and so they may not
      be suitable for turning directly into a filename.
      
      To handle this, CacheFiles will use a suitably printable filename directly and
      "base-64" encode ones that aren't directly suitable.  The two versions of
      object filenames indicate the encoding:
      
      	OBJECT TYPE	PRINTABLE	ENCODED
      	===============	===============	===============
      	Index		"I..."		"J..."
      	Data		"D..."		"E..."
      	Special		"S..."		"T..."
      
      Intermediate directories are always "@" or "+" as appropriate.
      
      Each object in the cache has an extended attribute label that holds the object
      type ID (required to distinguish special objects) and the auxiliary data from
      the netfs.  The latter is used to detect stale objects in the cache and update
      or retire them.
      
      Note that CacheFiles will erase from the cache any file it doesn't recognise or
      any file of an incorrect type (such as a FIFO file or a device file).
      
      ==========================
      SECURITY MODEL AND SELINUX
      ==========================
      
      CacheFiles is implemented to deal properly with the LSM security features of
      the Linux kernel and the SELinux facility.
      
      One of the problems that CacheFiles faces is that it is generally acting on
      behalf of a process, and running in that process's context, and that includes a
      security context that is not appropriate for accessing the cache - either
      because the files in the cache are inaccessible to that process, or because if
      the process creates a file in the cache, that file may be inaccessible to other
      processes.
      
      The way CacheFiles works is to temporarily change the security context (fsuid,
      fsgid and actor security label) that the process acts as - without changing the
      security context of the process when it the target of an operation performed by
      some other process (so signalling and suchlike still work correctly).
      
      When the CacheFiles module is asked to bind to its cache, it:
      
       (1) Finds the security label attached to the root cache directory and uses
           that as the security label with which it will create files.  By default,
           this is:
      
      	cachefiles_var_t
      
       (2) Finds the security label of the process which issued the bind request
           (presumed to be the cachefilesd daemon), which by default will be:
      
      	cachefilesd_t
      
           and asks LSM to supply a security ID as which it should act given the
           daemon's label.  By default, this will be:
      
      	cachefiles_kernel_t
      
           SELinux transitions the daemon's security ID to the module's security ID
           based on a rule of this form in the policy.
      
      	type_transition <daemon's-ID> kernel_t : process <module's-ID>;
      
           For instance:
      
      	type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
      
      The module's security ID gives it permission to create, move and remove files
      and directories in the cache, to find and access directories and files in the
      cache, to set and access extended attributes on cache objects, and to read and
      write files in the cache.
      
      The daemon's security ID gives it only a very restricted set of permissions: it
      may scan directories, stat files and erase files and directories.  It may
      not read or write files in the cache, and so it is precluded from accessing the
      data cached therein; nor is it permitted to create new files in the cache.
      
      There are policy source files available in:
      
      	http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
      
      and later versions.  In that tarball, see the files:
      
      	cachefilesd.te
      	cachefilesd.fc
      	cachefilesd.if
      
      They are built and installed directly by the RPM.
      
      If a non-RPM based system is being used, then copy the above files to their own
      directory and run:
      
      	make -f /usr/share/selinux/devel/Makefile
      	semodule -i cachefilesd.pp
      
      You will need checkpolicy and selinux-policy-devel installed prior to the
      build.
      
      By default, the cache is located in /var/fscache, but if it is desirable that
      it should be elsewhere, than either the above policy files must be altered, or
      an auxiliary policy must be installed to label the alternate location of the
      cache.
      
      For instructions on how to add an auxiliary policy to enable the cache to be
      located elsewhere when SELinux is in enforcing mode, please see:
      
      	/usr/share/doc/cachefilesd-*/move-cache.txt
      
      When the cachefilesd rpm is installed; alternatively, the document can be found
      in the sources.
      
      ==================
      A NOTE ON SECURITY
      ==================
      
      CacheFiles makes use of the split security in the task_struct.  It allocates
      its own task_security structure, and redirects current->act_as to point to it
      when it acts on behalf of another process, in that process's context.
      
      The reason it does this is that it calls vfs_mkdir() and suchlike rather than
      bypassing security and calling inode ops directly.  Therefore the VFS and LSM
      may deny the CacheFiles access to the cache data because under some
      circumstances the caching code is running in the security context of whatever
      process issued the original syscall on the netfs.
      
      Furthermore, should CacheFiles create a file or directory, the security
      parameters with that object is created (UID, GID, security label) would be
      derived from that process that issued the system call, thus potentially
      preventing other processes from accessing the cache - including CacheFiles's
      cache management daemon (cachefilesd).
      
      What is required is to temporarily override the security of the process that
      issued the system call.  We can't, however, just do an in-place change of the
      security data as that affects the process as an object, not just as a subject.
      This means it may lose signals or ptrace events for example, and affects what
      the process looks like in /proc.
      
      So CacheFiles makes use of a logical split in the security between the
      objective security (task->sec) and the subjective security (task->act_as).  The
      objective security holds the intrinsic security properties of a process and is
      never overridden.  This is what appears in /proc, and is what is used when a
      process is the target of an operation by some other process (SIGKILL for
      example).
      
      The subjective security holds the active security properties of a process, and
      may be overridden.  This is not seen externally, and is used whan a process
      acts upon another object, for example SIGKILLing another process or opening a
      file.
      
      LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request
      for CacheFiles to run in a context of a specific security label, or to create
      files and directories with another security label.
      
      This documentation is added by the patch to:
      
      	Documentation/filesystems/caching/cachefiles.txt
      Signed-Off-By: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarSteve Dickson <steved@redhat.com>
      Acked-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Tested-by: default avatarDaire Byrne <Daire.Byrne@framestore.com>
      9ae326a6