Skip to content
Snippets Groups Projects
  1. Jun 13, 2008
    • Nick Piggin's avatar
      radix-tree: fix small lockless radix-tree bug · 643b52b9
      Nick Piggin authored
      
      We shrink a radix tree when its root node has only one child, in the left
      most slot.  The child becomes the new root node.  To perform this
      operation in a manner compatible with concurrent lockless lookups, we
      atomically switch the root pointer from the parent to its child.
      
      However a concurrent lockless lookup may now have loaded a pointer to the
      parent (and is presently deciding what to do next).  For this reason, we
      also have to keep the parent node in a valid state after shrinking the
      tree, until the next RCU grace period -- otherwise this lookup with the
      parent pointer may not do the right thing.  Notably, we need to keep the
      child in the left most slot there in case that is requested by the lookup.
      
      This is all pretty standard RCU stuff.  It is worth repeating because in
      my eagerness to obey the radix tree node constructor scheme, I had broken
      it by zeroing the radix tree node before the grace period.
      
      What could happen is that a lookup can load the parent pointer, then
      decide it wants to follow the left most child slot, only to find the slot
      contained NULL due to the concurrent shrinker having zeroed the parent
      node before waiting for a grace period.  The lookup would return a false
      negative as a result.
      
      Fix it by doing that clearing in the RCU callback.  I would normally want
      to rip out the constructor entirely, but radix tree nodes are one of those
      places where they make sense (only few cachelines will be touched soon
      after allocation).
      
      This was never actually found in any lockless pagecache testing or by the
      test harness, but by seeing the odd problem with my scalable vmap rewrite.
       I have not tickled the test harness into reproducing it yet, but I'll
      keep working at it.
      
      Fortunately, it is not a problem anywhere lockless pagecache is used in
      mainline kernels (pagecache probe is not a guarantee, and brd does not
      have concurrent lookups and deletes).
      
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      643b52b9
  2. Jun 12, 2008
    • Jeremy Fitzhardinge's avatar
      add an inlined version of iter_div_u64_rem · d5e181f7
      Jeremy Fitzhardinge authored
      
      iter_div_u64_rem is used in the x86-64 vdso, which cannot call other
      kernel code.  For this case, provide the always_inlined version,
      __iter_div_u64_rem.
      
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      d5e181f7
    • Jeremy Fitzhardinge's avatar
      common implementation of iterative div/mod · f595ec96
      Jeremy Fitzhardinge authored
      
      We have a few instances of the open-coded iterative div/mod loop, used
      when we don't expcet the dividend to be much bigger than the divisor.
      Unfortunately modern gcc's have the tendency to strength "reduce" this
      into a full mod operation, which isn't necessarily any faster, and
      even if it were, doesn't exist if gcc implements it in libgcc.
      
      The workaround is to put a dummy asm statement in the loop to prevent
      gcc from performing the transformation.
      
      This patch creates a single implementation of this loop, and uses it
      to replace the open-coded versions I know about.
      
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Cc: Christian Kujau <lists@nerdbynature.de>
      Cc: Robert Hancock <hancockr@shaw.ca>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      f595ec96
  3. Jun 06, 2008
  4. May 19, 2008
  5. May 15, 2008
  6. May 13, 2008
  7. May 12, 2008
  8. May 11, 2008
    • Linus Torvalds's avatar
      BKL: revert back to the old spinlock implementation · 8e3e076c
      Linus Torvalds authored
      
      The generic semaphore rewrite had a huge performance regression on AIM7
      (and potentially other BKL-heavy benchmarks) because the generic
      semaphores had been rewritten to be simple to understand and fair.  The
      latter, in particular, turns a semaphore-based BKL implementation into a
      mess of scheduling.
      
      The attempt to fix the performance regression failed miserably (see the
      previous commit 00b41ec2 'Revert
      "semaphore: fix"'), and so for now the simple and sane approach is to
      instead just go back to the old spinlock-based BKL implementation that
      never had any issues like this.
      
      This patch also has the advantage of being reported to fix the
      regression completely according to Yanmin Zhang, unlike the semaphore
      hack which still left a couple percentage point regression.
      
      As a spinlock, the BKL obviously has the potential to be a latency
      issue, but it's not really any different from any other spinlock in that
      respect.  We do want to get rid of the BKL asap, but that has been the
      plan for several years.
      
      These days, the biggest users are in the tty layer (open/release in
      particular) and Alan holds out some hope:
      
        "tty release is probably a few months away from getting cured - I'm
         afraid it will almost certainly be the very last user of the BKL in
         tty to get fixed as it depends on everything else being sanely locked."
      
      so while we're not there yet, we do have a plan of action.
      
      Tested-by: default avatarYanmin Zhang <yanmin_zhang@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Alexander Viro <viro@ftp.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e3e076c
  9. May 05, 2008
  10. May 01, 2008
  11. Apr 30, 2008
  12. Apr 29, 2008
  13. Apr 28, 2008
    • Paul Jackson's avatar
      mempolicy: add bitmap_onto() and bitmap_fold() operations · 7ea931c9
      Paul Jackson authored
      
      The following adds two more bitmap operators, bitmap_onto() and bitmap_fold(),
      with the usual cpumask and nodemask wrappers.
      
      The bitmap_onto() operator computes one bitmap relative to another.  If the
      n-th bit in the origin mask is set, then the m-th bit of the destination mask
      will be set, where m is the position of the n-th set bit in the relative mask.
      
      The bitmap_fold() operator folds a bitmap into a second that has bit m set iff
      the input bitmap has some bit n set, where m == n mod sz, for the specified sz
      value.
      
      There are two substantive changes between this patch and its
      predecessor bitmap_relative:
       1) Renamed bitmap_relative() to be bitmap_onto().
       2) Added bitmap_fold().
      
      The essential motivation for bitmap_onto() is to provide a mechanism for
      converting a cpuset-relative CPU or Node mask to an absolute mask.  Cpuset
      relative masks are written as if the current task were in a cpuset whose CPUs
      or Nodes were just the consecutive ones numbered 0..N-1, for some N.  The
      bitmap_onto() operator is provided in anticipation of adding support for the
      first such cpuset relative mask, by the mbind() and set_mempolicy() system
      calls, using a planned flag of MPOL_F_RELATIVE_NODES.  These bitmap operators
      (and their nodemask wrappers, in particular) will be used in code that
      converts the user specified cpuset relative memory policy to a specific system
      node numbered policy, given the current mems_allowed of the tasks cpuset.
      
      Such cpuset relative mempolicies will address two deficiencies
      of the existing interface between cpusets and mempolicies:
       1) A task cannot at present reliably establish a cpuset
          relative mempolicy because there is an essential race
          condition, in that the tasks cpuset may be changed in
          between the time the task can query its cpuset placement,
          and the time the task can issue the applicable mbind or
          set_memplicy system call.
       2) A task cannot at present establish what cpuset relative
          mempolicy it would like to have, if it is in a smaller
          cpuset than it might have mempolicy preferences for,
          because the existing interface only allows specifying
          mempolicies for nodes currently allowed by the cpuset.
      
      Cpuset relative mempolicies are useful for tasks that don't distinguish
      particularly between one CPU or Node and another, but only between how many of
      each are allowed, and the proper placement of threads and memory pages on the
      various CPUs and Nodes available.
      
      The motivation for the added bitmap_fold() can be seen in the following
      example.
      
      Let's say an application has specified some mempolicies that presume 16 memory
      nodes, including say a mempolicy that specified MPOL_F_RELATIVE_NODES (cpuset
      relative) nodes 12-15.  Then lets say that application is crammed into a
      cpuset that only has 8 memory nodes, 0-7.  If one just uses bitmap_onto(),
      this mempolicy, mapped to that cpuset, would ignore the requested relative
      nodes above 7, leaving it empty of nodes.  That's not good; better to fold the
      higher nodes down, so that some nodes are included in the resulting mapped
      mempolicy.  In this case, the mempolicy nodes 12-15 are taken modulo 8 (the
      weight of the mems_allowed of the confining cpuset), resulting in a mempolicy
      specifying nodes 4-7.
      
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: <kosaki.motohiro@jp.fujitsu.com>
      Cc: <ray-lk@madrabbit.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ea931c9
    • Christoph Lameter's avatar
      Remove set_migrateflags() · 488514d1
      Christoph Lameter authored
      
      Migrate flags must be set on slab creation as agreed upon when the antifrag
      logic was reviewed.  Otherwise some slabs of a slabcache will end up in the
      unmovable and others in the reclaimable section depending on which flag was
      active when a new slab page was allocated.
      
      This likely slid in somehow when antifrag was merged. Remove it.
      
      The buffer_heads are always allocated with __GFP_RECLAIMABLE because the
      SLAB_RECLAIM_ACCOUNT option is set.  The set_migrateflags() never had any
      effect there.
      
      Radix tree allocations are not directly reclaimable but they are allocated
      with __GFP_RECLAIMABLE set on each allocation.  We now set
      SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix
      tree slabs are consistently placed in the reclaimable section.  Radix tree
      slabs will also be accounted as such.
      
      There is then no user left of set_migratepages. So remove it.
      
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      488514d1
  14. Apr 26, 2008
    • Alexander van Heukelum's avatar
      x86, bitops: select the generic bitmap search functions · 19870def
      Alexander van Heukelum authored
      
      Introduce GENERIC_FIND_FIRST_BIT and GENERIC_FIND_NEXT_BIT in
      lib/Kconfig, defaulting to off. An arch that wants to use the
      generic implementation now only has to use a select statement
      to include them.
      
      I added an always-y option (X86_CPU) to arch/x86/Kconfig.cpu
      and used that to select the generic search functions. This
      way ARCH=um SUBARCH=i386 automatically picks up the change
      too, and arch/um/Kconfig.i386 can therefore be simplified a
      bit. ARCH=um SUBARCH=x86_64 does things differently, but
      still compiles fine. It seems that a "def_bool y" always
      wins over a "def_bool n"?
      
      Signed-off-by: default avatarAlexander van Heukelum <heukelum@fastmail.fm>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      19870def
    • Alexander van Heukelum's avatar
      x86: generic versions of find_first_(zero_)bit, convert i386 · 77b9bd9c
      Alexander van Heukelum authored
      
      Generic versions of __find_first_bit and __find_first_zero_bit
      are introduced as simplified versions of __find_next_bit and
      __find_next_zero_bit. Their compilation and use are guarded by
      a new config variable GENERIC_FIND_FIRST_BIT.
      
      The generic versions of find_first_bit and find_first_zero_bit
      are implemented in terms of the newly introduced __find_first_bit
      and __find_first_zero_bit.
      
      This patch does not remove the i386-specific implementation,
      but it does switch i386 to use the generic functions by setting
      GENERIC_FIND_FIRST_BIT=y for X86_32.
      
      Signed-off-by: default avatarAlexander van Heukelum <heukelum@fastmail.fm>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      77b9bd9c
Loading