1. 29 Apr, 2008 2 commits
  2. 28 Apr, 2008 2 commits
    • Lee Schermerhorn's avatar
      mempolicy: rework mempolicy Reference Counting [yet again] · 52cd3b07
      Lee Schermerhorn authored
      After further discussion with Christoph Lameter, it has become clear that my
      earlier attempts to clean up the mempolicy reference counting were a bit of
      overkill in some areas, resulting in superflous ref/unref in what are usually
      fast paths.  In other areas, further inspection reveals that I botched the
      unref for interleave policies.
      A separate patch, suitable for upstream/stable trees, fixes up the known
      errors in the previous attempt to fix reference counting.
      This patch reworks the memory policy referencing counting and, one hopes,
      simplifies the code.  Maybe I'll get it right this time.
      See the update to the numa_memory_policy.txt document for a discussion of
      memory policy reference counting that motivates this patch.
      Lookup of mempolicy, based on (vma, address) need only add a reference for
      shared policy, and we need only unref the policy when finished for shared
      policies.  So, this patch backs out all of the unneeded extra reference
      counting added by my previous attempt.  It then unrefs only shared policies
      when we're finished with them, using the mpol_cond_put() [conditional put]
      helper function introduced by this patch.
      Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
      containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
      multiple times, so we can't let alloc_page_vma() unref the shared policy in
      this case.  To avoid this, we make a copy of any non-null shared policy and
      remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
      a page [or multiple pages] from swap, so the overhead should not be an issue
      I introduced a new static inline function "mpol_cond_copy()" to copy the
      shared policy to an on-stack policy and remove the flags that would require a
      conditional free.  The current implementation of mpol_cond_copy() assumes that
      the struct mempolicy contains no pointers to dynamically allocated structures
      that must be duplicated or reference counted during copy.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Lee Schermerhorn's avatar
      mempolicy: fixup Fallback for Default Shmem Policy · ae4d8c16
      Lee Schermerhorn authored
      get_vma_policy() is not handling fallback to task policy correctly when the
      get_policy() vm_op returns NULL.  The NULL overwrites the 'pol' variable that
      was holding the fallback task mempolicy.  So, it was falling back directly to
      system default policy.
      Fix get_vma_policy() to use only non-NULL policy returned from the vma
      get_policy op.
      shm_get_policy() was falling back to current task's mempolicy if the "backing
      file system" [tmpfs vs hugetlbfs] does not support the get_policy vm_op and
      the vma policy is null.  This is incorrect for show_numa_maps() which is
      likely querying the numa_maps of some task other than current.  Remove this
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  3. 11 Mar, 2008 1 commit
    • Lee Schermerhorn's avatar
      mempolicy: fix reference counting bugs · 69682d85
      Lee Schermerhorn authored
      Address 3 known bugs in the current memory policy reference counting method.
      I have a series of patches to rework the reference counting to reduce overhead
      in the allocation path.  However, that series will require testing in -mm once
      I repost it.
      1) alloc_page_vma() does not release the extra reference taken for
         vma/shared mempolicy when the mode == MPOL_INTERLEAVE.  This can result in
         leaking mempolicy structures.  This is probably occurring, but not being
         Fix:  add the conditional release of the reference.
      2) hugezonelist unconditionally releases a reference on the mempolicy when
         mode == MPOL_INTERLEAVE.  This can result in decrementing the reference
         count for system default policy [should have no ill effect] or premature
         freeing of task policy.  If this occurred, the next allocation using task
         mempolicy would use the freed structure and probably BUG out.
         Fix:  add the necessary check to the release.
      3) The current reference counting method assumes that vma 'get_policy()'
         methods automatically add an extra reference a non-NULL returned mempolicy.
          This is true for shmem_get_policy() used by tmpfs mappings, including
         regular page shm segments.  However, SHM_HUGETLB shm's, backed by
         hugetlbfs, just use the vma policy without the extra reference.  This
         results in freeing of the vma policy on the first allocation, with reuse of
         the freed mempolicy structure on subsequent allocations.
         Fix: Rather than add another condition to the conditional reference
         release, which occur in the allocation path, just add a reference when
         returning the vma policy in shm_get_policy() to match the assumptions.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <eric.whitney@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  4. 08 Feb, 2008 3 commits
    • Pierre Peiffer's avatar
      IPC: consolidate sem_exit_ns(), msg_exit_ns() and shm_exit_ns() · 01b8b07a
      Pierre Peiffer authored
      sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an
      ipc_namespace is released to free all ipcs of each type.  But in fact, they
      do the same thing: they loop around all ipcs to free them individually by
      calling a specific routine.
      This patch proposes to consolidate this by introducing a common function,
      free_ipcs(), that do the job.  The specific routine to call on each
      individual ipcs is passed as parameter.  For this, these ipc-specific
      'free' routines are reworked to take a generic 'struct ipc_perm' as
      Signed-off-by: default avatarPierre Peiffer <pierre.peiffer@bull.net>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Pierre Peiffer's avatar
      IPC: make struct ipc_ids static in ipc_namespace · ed2ddbf8
      Pierre Peiffer authored
      Each ipc_namespace contains a table of 3 pointers to struct ipc_ids (3 for
      msg, sem and shm, structure used to store all ipcs) These 'struct ipc_ids'
      are dynamically allocated for each icp_namespace as the ipc_namespace
      itself (for the init namespace, they are initialized with pointers to
      static variables instead)
      It is so for historical reason: in fact, before the use of idr to store the
      ipcs, the ipcs were stored in tables of variable length, depending of the
      maximum number of ipc allowed.  Now, these 'struct ipc_ids' have a fixed
      size.  As they are allocated in any cases for each new ipc_namespace, there
      is no gain of memory in having them allocated separately of the struct
      This patch proposes to make this table static in the struct ipc_namespace.
      Thus, we can allocate all in once and get rid of all the code needed to
      allocate and free these ipc_ids separately.
      Signed-off-by: default avatarPierre Peiffer <pierre.peiffer@bull.net>
      Acked-by: default avatarCedric Le Goater <clg@fr.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Pavel Emelyanov's avatar
      namespaces: move the IPC namespace under IPC_NS option · ae5e1b22
      Pavel Emelyanov authored
      Currently the IPC namespace management code is spread over the ipc/*.c files.
      I moved this code into ipc/namespace.c file which is compiled out when needed.
      The linux/ipc_namespace.h file is used to store the prototypes of the
      functions in namespace.c and the stubs for NAMESPACES=n case.  This is done
      so, because the stub for copy_ipc_namespace requires the knowledge of the
      CLONE_NEWIPC flag, which is in sched.h.  But the linux/ipc.h file itself in
      included into many many .c files via the sys.h->sem.h sequence so adding the
      sched.h into it will make all these .c depend on sched.h which is not that
      good.  On the other hand the knowledge about the namespaces stuff is required
      in 4 .c files only.
      Besides, this patch compiles out some auxiliary functions from ipc/sem.c,
      msg.c and shm.c files.  It turned out that moving these functions into
      namespaces.c is not that easy because they use many other calls and macros
      from the original file.  Moving them would make this patch complicated.  On
      the other hand all these functions can be consolidated, so I will send a
      separate patch doing this a bit later.
      Signed-off-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  5. 06 Feb, 2008 1 commit
    • Pierre Peiffer's avatar
      IPC: fix error check in all new xxx_lock() and xxx_exit_ns() functions · b1ed88b4
      Pierre Peiffer authored
      In the new implementation of the [sem|shm|msg]_lock[_check]() routines, we
      use the return value of ipc_lock() in container_of() without any check.
      But ipc_lock may return a errcode.  The use of this errcode in
      container_of() may alter this errcode, and we don't want this.
      And in xxx_exit_ns, the pointer return by idr_find is of type 'struct
      Today, the code will work as is because the member used in these
      container_of() is the first member of its container (offset == 0), the
      errcode isn't changed then.  But in the general case, we can't count on
      this assumption and this may lead later to a real bug if we don't correct
      Again, the proposed solution is simple and correct.  But, as pointed by
      Nadia, with this solution, the same check will be done several times (in
      all sub-callers...), what is not very funny/optimal...
      Signed-off-by: default avatarPierre Peiffer <pierre.peiffer@bull.net>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 19 Oct, 2007 10 commits
  7. 17 Oct, 2007 2 commits
    • Dave Hansen's avatar
      r/o bind mounts: filesystem helpers for custom 'struct file's · ce8d2cdf
      Dave Hansen authored
      Why do we need r/o bind mounts?
      This feature allows a read-only view into a read-write filesystem.  In the
      process of doing that, it also provides infrastructure for keeping track of
      the number of writers to any given mount.
      This has a number of uses.  It allows chroots to have parts of filesystems
      writable.  It will be useful for containers in the future because users may
      have root inside a container, but should not be allowed to write to
      somefilesystems.  This also replaces patches that vserver has had out of the
      tree for several years.
      It allows security enhancement by making sure that parts of your filesystem
      read-only (such as when you don't trust your FTP server), when you don't want
      to have entire new filesystems mounted, or when you want atime selectively
      updated.  I've been using the following script to test that the feature is
      working as desired.  It takes a directory and makes a regular bind and a r/o
      bind mount of it.  It then performs some normal filesystem operations on the
      three directories, including ones that are expected to fail, like creating a
      file on the r/o mount.
      This patch:
      Some filesystems forego the vfs and may_open() and create their own 'struct
      This patch creates a couple of helper functions which can be used by these
      filesystems, and will provide a unified place which the r/o bind mount code
      may patch.
      Also, rename an existing, static-scope init_file() to a less generic name.
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Adrian Bunk's avatar
      ipc/shm.c: make 2 functions static · d823e3e7
      Adrian Bunk authored
      This patch makes two needlessly global functions static.
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  8. 31 Jul, 2007 2 commits
  9. 19 Jul, 2007 2 commits
    • Nick Piggin's avatar
      mm: fault feedback #1 · d0217ac0
      Nick Piggin authored
      Change ->fault prototype.  We now return an int, which contains
      VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
       FAULT_RET_ code tells the VM whether a page was found, whether it has been
      locked, and potentially other things.  This is not quite the way he wanted
      it yet, but that's changed in the next patch (which requires changes to
      arch code).
      This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
      that a page is locked which requires filemap_nopage to go away (because we
      can no longer remain backward compatible without that flag), but we were
      going to do that anyway.
      struct fault_data is renamed to struct vm_fault as Linus asked. address
      is now a void __user * that we should firmly encourage drivers not to use
      without really good reason.
      The page is now returned via a page pointer in the vm_fault struct.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Nick Piggin's avatar
      mm: merge populate and nopage into fault (fixes nonlinear) · 54cb8821
      Nick Piggin authored
      Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
      the virtual address -> file offset differently from linear mappings.
      ->populate is a layering violation because the filesystem/pagecache code
      should need to know anything about the virtual memory mapping.  The hitch here
      is that the ->nopage handler didn't pass down enough information (ie.  pgoff).
       But it is more logical to pass pgoff rather than have the ->nopage function
      calculate it itself anyway (because that's a similar layering violation).
      Having the populate handler install the pte itself is likewise a nasty thing
      to be doing.
      This patch introduces a new fault handler that replaces ->nopage and
      ->populate and (later) ->nopfn.  Most of the old mechanism is still in place
      so there is a lot of duplication and nice cleanups that can be removed if
      everyone switches over.
      The rationale for doing this in the first place is that nonlinear mappings are
      subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
      to duplicate the synchronisation logic rather than just consolidate the two.
      After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
      pagecache.  Seems like a fringe functionality anyway.
      NOPAGE_REFAULT is removed.  This should be implemented with ->fault, and no
      users have hit mainline yet.
      [akpm@linux-foundation.org: cleanup]
      [randy.dunlap@oracle.com: doc. fixes for readahead]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  10. 16 Jul, 2007 1 commit
  11. 16 Jun, 2007 3 commits
  12. 02 Mar, 2007 1 commit
    • Adam Litke's avatar
      [PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments · 516dffdc
      Adam Litke authored
      This patch provides the following hugetlb-related fixes to the recent stacked
      shm files changes:
       - Update is_file_hugepages() so it will reconize hugetlb shm segments.
       - get_unmapped_area must be called with the nested file struct to handle
         the sfd->file->f_ops->get_unmapped_area == NULL case.
       - The fsync f_op must be wrapped since it is specified in the hugetlbfs
      This is based on proposed fixes from Eric Biederman that were debugged and
      tested by me.  Without it, attempting to use hugetlb shared memory segments
      on powerpc (and likely ia64) will kill your box.
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarWilliam Irwin <bill.irwin@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 01 Mar, 2007 1 commit
  14. 21 Feb, 2007 1 commit
    • Eric W. Biederman's avatar
      [PATCH] shm: make sysv ipc shared memory use stacked files · bc56bba8
      Eric W. Biederman authored
      The current ipc shared memory code runs into several problems because it
      does not quite use files like the rest of the kernel.  With the option of
      backing ipc shared memory with either hugetlbfs or ordinary shared memory
      the problems got worse.  With the added support for ipc namespaces things
      behaved so unexpected that we now have several bad namespace reference
      counting bugs when using what appears at first glance to be a reasonable
      So to attack these problems and hopefully make the code more maintainable
      this patch simply uses the files provided by other parts of the kernel and
      builds it's own files out of them.  The shm files are allocated in do_shmat
      and freed when their reference count drops to zero with their last unmap.
      The file and vm operations that we don't want to implement or we don't
      implement completely we just delegate to the operations of our backing
      This means that we now get an accurate shm_nattch count for we have a
      hugetlbfs inode for backing store, and the shm accounting of last attach
      and last detach time work as well.
      This means that getting a reference to the ipc namespace when we create the
      file and dropping the referenece in the release method is now safe and
      This means we no longer need a special case for clearing VM_MAYWRITE
      as our file descriptor now only has write permissions when we have
      requested write access when calling shmat.  Although VM_SHARED is now
      cleared as well which I believe is harmless and is mostly likely a
      minor bug fix.
      By using the same set of operations for both the hugetlb case and regular
      shared memory case shmdt is not simplified and made slightly more correct
      as now the test "vma->vm_ops == &shm_vm_ops" is 100% accurate in spotting
      all shared memory regions generated from sysvipc shared memory.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  15. 12 Feb, 2007 1 commit
  16. 23 Jan, 2007 1 commit
  17. 08 Dec, 2006 1 commit
  18. 03 Nov, 2006 1 commit
    • Pavel Emelianov's avatar
      [PATCH] Fix ipc entries removal · c7e12b83
      Pavel Emelianov authored
      Fix two issuses related to ipc_ids->entries freeing.
      1. When freeing ipc namespace we need to free entries allocated
         with ipc_init_ids().
      2. When removing old entries in grow_ary() ipc_rcu_putref()
         may be called on entries set to &ids->nullentry earlier in
         This is almost impossible without namespaces, but with
         them this situation becomes possible.
      Found during OpenVZ testing after obvious leaks in beancounters.
      Signed-off-by: default avatarPavel Emelianov <xemul@openvz.org>
      Cc: Kirill Korotaev <dev@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  19. 02 Oct, 2006 1 commit
  20. 30 Jun, 2006 1 commit
  21. 23 Jun, 2006 1 commit
  22. 20 Jun, 2006 1 commit
    • Linda Knippers's avatar
      [PATCH] update of IPC audit record cleanup · ac03221a
      Linda Knippers authored
      The following patch addresses most of the issues with the IPC_SET_PERM
      records as described in:
      and addresses the comments I received on the record field names.
      To summarize, I made the following changes:
      1. Changed sys_msgctl() and semctl_down() so that an IPC_SET_PERM
         record is emitted in the failure case as well as the success case.
         This matches the behavior in sys_shmctl().  I could simplify the
         code in sys_msgctl() and semctl_down() slightly but it would mean
         that in some error cases we could get an IPC_SET_PERM record
         without an IPC record and that seemed odd.
      2. No change to the IPC record type, given no feedback on the backward
         compatibility question.
      3. Removed the qbytes field from the IPC record.  It wasn't being
         set and when audit_ipc_obj() is called from ipcperms(), the
         information isn't available.  If we want the information in the IPC
         record, more extensive changes will be necessary.  Since it only
         applies to message queues and it isn't really permission related, it
         doesn't seem worth it.
      4. Removed the obj field from the IPC_SET_PERM record.  This means that
         the kern_ipc_perm argument is no longer needed.
      5. Removed the spaces and renamed the IPC_SET_PERM field names.  Replaced iuid and
         igid fields with ouid and ogid in the IPC record.
      I tested this with the lspp.22 kernel on an x86_64 box.  I believe it
      applies cleanly on the latest kernel.
      -- ljk
      Signed-off-by: default avatarLinda Knippers <linda.knippers@hp.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>