1. 13 Nov, 2008 3 commits
  2. 21 Oct, 2008 1 commit
  3. 20 Oct, 2008 1 commit
    • Lee Schermerhorn's avatar
      SHM_LOCKED pages are unevictable · 89e004ea
      Lee Schermerhorn authored
      Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
      kept on the normal LRU, since scanning them is a waste of time and might
      throw off kswapd's balancing algorithms.  Place them on the unevictable
      LRU list instead.
      
      Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
      memory regions as unevictable.  Then these pages will be culled off the
      normal LRU lists during vmscan.
      
      Add new wrapper function to clear the mapping's unevictable state when/if
      shared memory segment is munlocked.
      
      Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
      the shmem segment's mapping [struct address_space] for evictability now
      that they're no longer locked.  If so, move them to the appropriate zone
      lru list.
      
      Changes depend on [CONFIG_]UNEVICTABLE_LRU.
      
      [kosaki.motohiro@jp.fujitsu.com: revert shm change]
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarKosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89e004ea
  4. 25 Jul, 2008 1 commit
  5. 24 Jul, 2008 1 commit
  6. 13 Jun, 2008 1 commit
    • Paul Menage's avatar
      /proc/sysvipc/shm: fix 32-bit truncation of segment sizes · 6c826818
      Paul Menage authored
      sysvipc_shm_proc_show() picks between format strings (based on the
      expected maximum length of a SHM segment) in a way that prevents gcc from
      performing format checks on the seq_printf() parameters.  This hid two
      format errors - shp->shm_segsz and shp->shm_nattach are both unsigned
      long, but were being printed as unsigned int and signed int respectively.
      This leads to 32-bit truncation of SHM segment sizes reported in
      /proc/sysvipc/shm.  (And for nattach, but that's less of a problem for
      most users).
      
      This patch makes the format string directly visible to gcc's format
      specifier checker, and fixes the two broken format specifiers.
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Pierre Peiffer <peifferp@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c826818
  7. 10 Jun, 2008 1 commit
  8. 29 Apr, 2008 5 commits
  9. 28 Apr, 2008 2 commits
    • Lee Schermerhorn's avatar
      mempolicy: rework mempolicy Reference Counting [yet again] · 52cd3b07
      Lee Schermerhorn authored
      After further discussion with Christoph Lameter, it has become clear that my
      earlier attempts to clean up the mempolicy reference counting were a bit of
      overkill in some areas, resulting in superflous ref/unref in what are usually
      fast paths.  In other areas, further inspection reveals that I botched the
      unref for interleave policies.
      
      A separate patch, suitable for upstream/stable trees, fixes up the known
      errors in the previous attempt to fix reference counting.
      
      This patch reworks the memory policy referencing counting and, one hopes,
      simplifies the code.  Maybe I'll get it right this time.
      
      See the update to the numa_memory_policy.txt document for a discussion of
      memory policy reference counting that motivates this patch.
      
      Summary:
      
      Lookup of mempolicy, based on (vma, address) need only add a reference for
      shared policy, and we need only unref the policy when finished for shared
      policies.  So, this patch backs out all of the unneeded extra reference
      counting added by my previous attempt.  It then unrefs only shared policies
      when we're finished with them, using the mpol_cond_put() [conditional put]
      helper function introduced by this patch.
      
      Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
      containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
      multiple times, so we can't let alloc_page_vma() unref the shared policy in
      this case.  To avoid this, we make a copy of any non-null shared policy and
      remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
      a page [or multiple pages] from swap, so the overhead should not be an issue
      here.
      
      I introduced a new static inline function "mpol_cond_copy()" to copy the
      shared policy to an on-stack policy and remove the flags that would require a
      conditional free.  The current implementation of mpol_cond_copy() assumes that
      the struct mempolicy contains no pointers to dynamically allocated structures
      that must be duplicated or reference counted during copy.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52cd3b07
    • Lee Schermerhorn's avatar
      mempolicy: fixup Fallback for Default Shmem Policy · ae4d8c16
      Lee Schermerhorn authored
      get_vma_policy() is not handling fallback to task policy correctly when the
      get_policy() vm_op returns NULL.  The NULL overwrites the 'pol' variable that
      was holding the fallback task mempolicy.  So, it was falling back directly to
      system default policy.
      
      Fix get_vma_policy() to use only non-NULL policy returned from the vma
      get_policy op.
      
      shm_get_policy() was falling back to current task's mempolicy if the "backing
      file system" [tmpfs vs hugetlbfs] does not support the get_policy vm_op and
      the vma policy is null.  This is incorrect for show_numa_maps() which is
      likely querying the numa_maps of some task other than current.  Remove this
      fallback.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae4d8c16
  10. 11 Mar, 2008 1 commit
    • Lee Schermerhorn's avatar
      mempolicy: fix reference counting bugs · 69682d85
      Lee Schermerhorn authored
      Address 3 known bugs in the current memory policy reference counting method.
      I have a series of patches to rework the reference counting to reduce overhead
      in the allocation path.  However, that series will require testing in -mm once
      I repost it.
      
      1) alloc_page_vma() does not release the extra reference taken for
         vma/shared mempolicy when the mode == MPOL_INTERLEAVE.  This can result in
         leaking mempolicy structures.  This is probably occurring, but not being
         noticed.
      
         Fix:  add the conditional release of the reference.
      
      2) hugezonelist unconditionally releases a reference on the mempolicy when
         mode == MPOL_INTERLEAVE.  This can result in decrementing the reference
         count for system default policy [should have no ill effect] or premature
         freeing of task policy.  If this occurred, the next allocation using task
         mempolicy would use the freed structure and probably BUG out.
      
         Fix:  add the necessary check to the release.
      
      3) The current reference counting method assumes that vma 'get_policy()'
         methods automatically add an extra reference a non-NULL returned mempolicy.
          This is true for shmem_get_policy() used by tmpfs mappings, including
         regular page shm segments.  However, SHM_HUGETLB shm's, backed by
         hugetlbfs, just use the vma policy without the extra reference.  This
         results in freeing of the vma policy on the first allocation, with reuse of
         the freed mempolicy structure on subsequent allocations.
      
         Fix: Rather than add another condition to the conditional reference
         release, which occur in the allocation path, just add a reference when
         returning the vma policy in shm_get_policy() to match the assumptions.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <eric.whitney@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69682d85
  11. 08 Feb, 2008 3 commits
    • Pierre Peiffer's avatar
      IPC: consolidate sem_exit_ns(), msg_exit_ns() and shm_exit_ns() · 01b8b07a
      Pierre Peiffer authored
      sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an
      ipc_namespace is released to free all ipcs of each type.  But in fact, they
      do the same thing: they loop around all ipcs to free them individually by
      calling a specific routine.
      
      This patch proposes to consolidate this by introducing a common function,
      free_ipcs(), that do the job.  The specific routine to call on each
      individual ipcs is passed as parameter.  For this, these ipc-specific
      'free' routines are reworked to take a generic 'struct ipc_perm' as
      parameter.
      Signed-off-by: default avatarPierre Peiffer <pierre.peiffer@bull.net>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01b8b07a
    • Pierre Peiffer's avatar
      IPC: make struct ipc_ids static in ipc_namespace · ed2ddbf8
      Pierre Peiffer authored
      Each ipc_namespace contains a table of 3 pointers to struct ipc_ids (3 for
      msg, sem and shm, structure used to store all ipcs) These 'struct ipc_ids'
      are dynamically allocated for each icp_namespace as the ipc_namespace
      itself (for the init namespace, they are initialized with pointers to
      static variables instead)
      
      It is so for historical reason: in fact, before the use of idr to store the
      ipcs, the ipcs were stored in tables of variable length, depending of the
      maximum number of ipc allowed.  Now, these 'struct ipc_ids' have a fixed
      size.  As they are allocated in any cases for each new ipc_namespace, there
      is no gain of memory in having them allocated separately of the struct
      ipc_namespace.
      
      This patch proposes to make this table static in the struct ipc_namespace.
      Thus, we can allocate all in once and get rid of all the code needed to
      allocate and free these ipc_ids separately.
      Signed-off-by: default avatarPierre Peiffer <pierre.peiffer@bull.net>
      Acked-by: default avatarCedric Le Goater <clg@fr.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed2ddbf8
    • Pavel Emelyanov's avatar
      namespaces: move the IPC namespace under IPC_NS option · ae5e1b22
      Pavel Emelyanov authored
      Currently the IPC namespace management code is spread over the ipc/*.c files.
      I moved this code into ipc/namespace.c file which is compiled out when needed.
      
      The linux/ipc_namespace.h file is used to store the prototypes of the
      functions in namespace.c and the stubs for NAMESPACES=n case.  This is done
      so, because the stub for copy_ipc_namespace requires the knowledge of the
      CLONE_NEWIPC flag, which is in sched.h.  But the linux/ipc.h file itself in
      included into many many .c files via the sys.h->sem.h sequence so adding the
      sched.h into it will make all these .c depend on sched.h which is not that
      good.  On the other hand the knowledge about the namespaces stuff is required
      in 4 .c files only.
      
      Besides, this patch compiles out some auxiliary functions from ipc/sem.c,
      msg.c and shm.c files.  It turned out that moving these functions into
      namespaces.c is not that easy because they use many other calls and macros
      from the original file.  Moving them would make this patch complicated.  On
      the other hand all these functions can be consolidated, so I will send a
      separate patch doing this a bit later.
      Signed-off-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae5e1b22
  12. 06 Feb, 2008 1 commit
    • Pierre Peiffer's avatar
      IPC: fix error check in all new xxx_lock() and xxx_exit_ns() functions · b1ed88b4
      Pierre Peiffer authored
      In the new implementation of the [sem|shm|msg]_lock[_check]() routines, we
      use the return value of ipc_lock() in container_of() without any check.
      But ipc_lock may return a errcode.  The use of this errcode in
      container_of() may alter this errcode, and we don't want this.
      
      And in xxx_exit_ns, the pointer return by idr_find is of type 'struct
      kern_ipc_per'...
      
      Today, the code will work as is because the member used in these
      container_of() is the first member of its container (offset == 0), the
      errcode isn't changed then.  But in the general case, we can't count on
      this assumption and this may lead later to a real bug if we don't correct
      this.
      
      Again, the proposed solution is simple and correct.  But, as pointed by
      Nadia, with this solution, the same check will be done several times (in
      all sub-callers...), what is not very funny/optimal...
      Signed-off-by: default avatarPierre Peiffer <pierre.peiffer@bull.net>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1ed88b4
  13. 19 Oct, 2007 10 commits
  14. 17 Oct, 2007 2 commits
    • Dave Hansen's avatar
      r/o bind mounts: filesystem helpers for custom 'struct file's · ce8d2cdf
      Dave Hansen authored
      Why do we need r/o bind mounts?
      
      This feature allows a read-only view into a read-write filesystem.  In the
      process of doing that, it also provides infrastructure for keeping track of
      the number of writers to any given mount.
      
      This has a number of uses.  It allows chroots to have parts of filesystems
      writable.  It will be useful for containers in the future because users may
      have root inside a container, but should not be allowed to write to
      somefilesystems.  This also replaces patches that vserver has had out of the
      tree for several years.
      
      It allows security enhancement by making sure that parts of your filesystem
      read-only (such as when you don't trust your FTP server), when you don't want
      to have entire new filesystems mounted, or when you want atime selectively
      updated.  I've been using the following script to test that the feature is
      working as desired.  It takes a directory and makes a regular bind and a r/o
      bind mount of it.  It then performs some normal filesystem operations on the
      three directories, including ones that are expected to fail, like creating a
      file on the r/o mount.
      
      This patch:
      
      Some filesystems forego the vfs and may_open() and create their own 'struct
      file's.
      
      This patch creates a couple of helper functions which can be used by these
      filesystems, and will provide a unified place which the r/o bind mount code
      may patch.
      
      Also, rename an existing, static-scope init_file() to a less generic name.
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce8d2cdf
    • Adrian Bunk's avatar
      ipc/shm.c: make 2 functions static · d823e3e7
      Adrian Bunk authored
      This patch makes two needlessly global functions static.
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d823e3e7
  15. 31 Jul, 2007 2 commits
  16. 19 Jul, 2007 2 commits
    • Nick Piggin's avatar
      mm: fault feedback #1 · d0217ac0
      Nick Piggin authored
      Change ->fault prototype.  We now return an int, which contains
      VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
       FAULT_RET_ code tells the VM whether a page was found, whether it has been
      locked, and potentially other things.  This is not quite the way he wanted
      it yet, but that's changed in the next patch (which requires changes to
      arch code).
      
      This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
      that a page is locked which requires filemap_nopage to go away (because we
      can no longer remain backward compatible without that flag), but we were
      going to do that anyway.
      
      struct fault_data is renamed to struct vm_fault as Linus asked. address
      is now a void __user * that we should firmly encourage drivers not to use
      without really good reason.
      
      The page is now returned via a page pointer in the vm_fault struct.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0217ac0
    • Nick Piggin's avatar
      mm: merge populate and nopage into fault (fixes nonlinear) · 54cb8821
      Nick Piggin authored
      Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
      the virtual address -> file offset differently from linear mappings.
      
      ->populate is a layering violation because the filesystem/pagecache code
      should need to know anything about the virtual memory mapping.  The hitch here
      is that the ->nopage handler didn't pass down enough information (ie.  pgoff).
       But it is more logical to pass pgoff rather than have the ->nopage function
      calculate it itself anyway (because that's a similar layering violation).
      
      Having the populate handler install the pte itself is likewise a nasty thing
      to be doing.
      
      This patch introduces a new fault handler that replaces ->nopage and
      ->populate and (later) ->nopfn.  Most of the old mechanism is still in place
      so there is a lot of duplication and nice cleanups that can be removed if
      everyone switches over.
      
      The rationale for doing this in the first place is that nonlinear mappings are
      subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
      to duplicate the synchronisation logic rather than just consolidate the two.
      
      After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
      pagecache.  Seems like a fringe functionality anyway.
      
      NOPAGE_REFAULT is removed.  This should be implemented with ->fault, and no
      users have hit mainline yet.
      
      [akpm@linux-foundation.org: cleanup]
      [randy.dunlap@oracle.com: doc. fixes for readahead]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54cb8821
  17. 16 Jul, 2007 1 commit
  18. 16 Jun, 2007 2 commits
    • Eric W. Biederman's avatar
      shm: fix the filename of hugetlb sysv shared memory · 9d66586f
      Eric W. Biederman authored
      Some user space tools need to identify SYSV shared memory when examining
      /proc/<pid>/maps.  To do so they look for a block device with major zero, a
      dentry named SYSV<sysv key>, and having the minor of the internal sysv
      shared memory kernel mount.
      
      To help these tools and to make it easier for people just browsing
      /proc/<pid>/maps this patch modifies hugetlb sysv shared memory to use the
      SYSV<key> dentry naming convention.
      
      User space tools will still have to be aware that hugetlb sysv shared
      memory lives on a different internal kernel mount and so has a different
      block device minor number from the rest of sysv shared memory.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Albert Cahalan <acahalan@gmail.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d66586f
    • Adam Litke's avatar
      hugetlb: fix get_policy for stacked shared memory files · 22741925
      Adam Litke authored
      Here's another breakage as a result of shared memory stacked files :(
      
      The NUMA policy for a VMA is determined by checking the following (in the
      order given):
      
      1) vma->vm_ops->get_policy() (if defined)
      2) vma->vm_policy (if defined)
      3) task->mempolicy (if defined)
      4) Fall back to default_policy
      
      By switching to stacked files for shared memory, get_policy() is now always
      set to shm_get_policy which is a wrapper function.  This causes us to stop
      at step 1, which yields NULL for hugetlb instead of task->mempolicy which
      was the previous (and correct) result.
      
      This patch modifies the shm_get_policy() wrapper to maintain steps 1-3 for
      the wrapped vm_ops.
      
      (akpm: the refcounting of mempolicies is busted and this patch does nothing to
      improve it)
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarWilliam Irwin <bill.irwin@oracle.com>
      Cc: dean gaudet <dean@arctic.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22741925