1. 24 Jul, 2008 6 commits
    • Andi Kleen's avatar
      hugetlbfs: per mount huge page sizes · a137e1cc
      Andi Kleen authored
      
      
      Add the ability to configure the hugetlb hstate used on a per mount basis.
      
      - Add a new pagesize= option to the hugetlbfs mount that allows setting
        the page size
      - This option causes the mount code to find the hstate corresponding to the
        specified size, and sets up a pointer to the hstate in the mount's
        superblock.
      - Change the hstate accessors to use this information rather than the
        global_hstate they were using (requires a slight change in mm/memory.c
        so we don't NULL deref in the error-unmap path -- see comments).
      
      [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a137e1cc
    • Andi Kleen's avatar
      hugetlb: modular state for hugetlb page size · a5516438
      Andi Kleen authored
      
      
      The goal of this patchset is to support multiple hugetlb page sizes.  This
      is achieved by introducing a new struct hstate structure, which
      encapsulates the important hugetlb state and constants (eg.  huge page
      size, number of huge pages currently allocated, etc).
      
      The hstate structure is then passed around the code which requires these
      fields, they will do the right thing regardless of the exact hstate they
      are operating on.
      
      This patch adds the hstate structure, with a single global instance of it
      (default_hstate), and does the basic work of converting hugetlb to use the
      hstate.
      
      Future patches will add more hstate structures to allow for different
      hugetlbfs mounts to have different page sizes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5516438
    • Mel Gorman's avatar
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE)... · 04f2cbe3
      Mel Gorman authored
      
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
      
      After patch 2 in this series, a process that successfully calls mmap() for
      a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
      process calls fork().  At that point, the next write fault from the parent
      could fail due to COW if the child still has a reference.
      
      We only reserve pages for the parent but a copy must be made to avoid
      leaking data from the parent to the child after fork().  Reserves could be
      taken for both parent and child at fork time to guarantee faults but if
      the mapping is large it is highly likely we will not have sufficient pages
      for the reservation, and it is common to fork only to exec() immediatly
      after.  A failure here would be very undesirable.
      
      Note that the current behaviour of mainline with MAP_PRIVATE pages is
      pretty bad.  The following situation is allowed to occur today.
      
      1. Process calls mmap(MAP_PRIVATE)
      2. Process calls mlock() to fault all pages and makes sure it succeeds
      3. Process forks()
      4. Process writes to MAP_PRIVATE mapping while child still exists
      5. If the COW fails at this point, the process gets SIGKILLed even though it
         had taken care to ensure the pages existed
      
      This patch improves the situation by guaranteeing the reliability of the
      process that successfully calls mmap().  When the parent performs COW, it
      will try to satisfy the allocation without using reserves.  If that fails
      the parent will steal the page leaving any children without a page.
      Faults from the child after that point will result in failure.  If the
      child COW happens first, an attempt will be made to allocate the page
      without reserves and the child will get SIGKILLed on failure.
      
      To summarise the new behaviour:
      
      1. If the original mapper performs COW on a private mapping with multiple
         references, it will attempt to allocate a hugepage from the pool or
         the buddy allocator without using the existing reserves. On fail, VMAs
         mapping the same area are traversed and the page being COW'd is unmapped
         where found. It will then steal the original page as the last mapper in
         the normal way.
      
      2. The VMAs the pages were unmapped from are flagged to note that pages
         with data no longer exist. Future no-page faults on those VMAs will
         terminate the process as otherwise it would appear that data was corrupted.
         A warning is printed to the console that this situation occured.
      
      2. If the child performs COW first, it will attempt to satisfy the COW
         from the pool if there are enough pages or via the buddy allocator if
         overcommit is allowed and the buddy allocator can satisfy the request. If
         it fails, the child will be killed.
      
      If the pool is large enough, existing applications will not notice that
      the reserves were a factor.  Existing applications depending on the
      no-reserves been set are unlikely to exist as for much of the history of
      hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
      point or failing the mmap().
      
      [npiggin@suse.de: fix CONFIG_HUGETLB=n build]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04f2cbe3
    • Jan Beulich's avatar
      mm: remove double indirection on tlb parameter to free_pgd_range() & Co · 42b77728
      Jan Beulich authored
      
      
      The double indirection here is not needed anywhere and hence (at least)
      confusing.
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Acked-by: default avatarJeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      42b77728
    • Rik van Riel's avatar
      access_process_vm device memory infrastructure · 28b2ee20
      Rik van Riel authored
      
      
      In order to be able to debug things like the X server and programs using
      the PPC Cell SPUs, the debugger needs to be able to access device memory
      through ptrace and /proc/pid/mem.
      
      This patch:
      
      Add the generic_access_phys access function and put the hooks in place
      to allow access_process_vm to access device or PPC Cell SPU memory.
      
      [riel@redhat.com: Add documentation for the vm_ops->access function]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarBenjamin Herrensmidt <benh@kernel.crashing.org>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28b2ee20
    • Nick Piggin's avatar
      mm: remove nopfn · 0d71d10a
      Nick Piggin authored
      
      
      There are no users of nopfn in the tree. Remove it.
      
      [hugh@veritas.com: fix build error]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d71d10a
  2. 04 Jul, 2008 2 commits
    • Oleg Nesterov's avatar
      get_user_pages(): fix possible page leak on oom · 7a36a752
      Oleg Nesterov authored
      
      
      get_user_pages() must not return the error when i != 0.  When pages !=
      NULL we have i get_page()'ed pages.
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a36a752
    • Peter Zijlstra's avatar
      mm: dirty page accounting vs VM_MIXEDMAP · 251b97f5
      Peter Zijlstra authored
      
      
      Dirty page accounting accurately measures the amound of dirty pages in
      writable shared mappings by mapping the pages RO (as indicated by
      vma_wants_writenotify).  We then trap on first write and call
      set_page_dirty() on the page, after which we map the page RW and
      continue execution.
      
      When we launder dirty pages, we call clear_page_dirty_for_io() which
      clears both the dirty flag, and maps the page RO again before we start
      writeout so that the story can repeat itself.
      
      vma_wants_writenotify() excludes VM_PFNMAP on the basis that we cannot
      do the regular dirty page stuff on raw PFNs and the memory isn't going
      anywhere anyway.
      
      The recently introduced VM_MIXEDMAP mixes both !pfn_valid() and
      pfn_valid() pages in a single mapping.
      
      We can't do dirty page accounting on !pfn_valid() pages as stated
      above, and mapping them RO causes them to be COW'ed on write, which
      breaks VM_SHARED semantics.
      
      Excluding VM_MIXEDMAP in vma_wants_writenotify() would mean we don't do
      the regular dirty page accounting for the pfn_valid() pages, which
      would bring back all the head-aches from inaccurate dirty page
      accounting.
      
      So instead, we let the !pfn_valid() pages get mapped RO, but fix them
      up unconditionally in the fault path.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: "Jared Hulbert" <jaredeh@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      251b97f5
  3. 23 Jun, 2008 2 commits
    • Nick Piggin's avatar
      mm: fix race in COW logic · 945754a1
      Nick Piggin authored
      
      
      There is a race in the COW logic.  It contains a shortcut to avoid the
      COW and reuse the page if we have the sole reference on the page,
      however it is possible to have two racing do_wp_page()ers with one
      causing the other to mistakenly believe it is safe to take the shortcut
      when it is not.  This could lead to data corruption.
      
      Process 1 and process2 each have a wp pte of the same anon page (ie.
      one forked the other).  The page's mapcount is 2.  Then they both
      attempt to write to it around the same time...
      
        proc1				proc2 thr1			proc2 thr2
        CPU0				CPU1				CPU3
        do_wp_page()			do_wp_page()
      				 trylock_page()
      				  can_share_swap_page()
      				   load page mapcount (==2)
      				  reuse = 0
      				 pte unlock
      				 copy page to new_page
      				 pte lock
      				 page_remove_rmap(page);
         trylock_page()
          can_share_swap_page()
           load page mapcount (==1)
          reuse = 1
         ptep_set_access_flags (allow W)
      
        write private key into page
      								read from page
      				ptep_clear_flush()
      				set_pte_at(pte of new_page)
      
      Fix this by moving the page_remove_rmap of the old page after the pte
      clear and flush.  Potentially the entire branch could be moved down
      here, but in order to stay consistent, I won't (should probably move all
      the *_mm_counter stuff with one patch).
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      945754a1
    • Linus Torvalds's avatar
      Fix ZERO_PAGE breakage with vmware · 672ca28e
      Linus Torvalds authored
      Commit 89f5b7da
      
       ("Reinstate ZERO_PAGE
      optimization in 'get_user_pages()' and fix XIP") broke vmware, as
      reported by Jeff Chua:
      
        "This broke vmware 6.0.4.
         Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
         /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"
      
      and the reason seems to be that there's an old bug in how we handle do
      FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
      triggered if the whole page table was missing, nobody had apparently hit
      it before.
      
      The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
      not just for whole missing page tables, but for individual pages as
      well, and exposed this problem.
      
      This fixes it by making the test for when FOLL_ANON is used more
      careful, and also makes the code easier to read and understand by moving
      the logic to a separate inline function.
      Reported-and-tested-by: default avatarJeff Chua <jeff.chua.linux@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      672ca28e
  4. 20 Jun, 2008 1 commit
    • Linus Torvalds's avatar
      Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP · 89f5b7da
      Linus Torvalds authored
      KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
      557ed1fa
      
       ("remove ZERO_PAGE") removed
      the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
      generally now populate the VM with real empty pages needlessly.
      
      We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
      since fault handling no longer uses ZERO_PAGE for new anonymous pages,
      we now need to handle that special case in follow_page() instead.
      
      In particular, the removal of ZERO_PAGE effectively removed the core
      file writing optimization where we would skip writing pages that had not
      been populated at all, and increased memory pressure a lot by allocating
      all those useless newly zeroed pages.
      
      This reinstates the optimization by making the unmapped PTE case the
      same as for a non-existent page table, which already did this correctly.
      
      While at it, this also fixes the XIP case for follow_page(), where the
      caller could not differentiate between the case of a page that simply
      could not be used (because it had no "struct page" associated with it)
      and a page that just wasn't mapped.
      
      We do that by simply returning an error pointer for pages that could not
      be turned into a "struct page *".  The error is arbitrarily picked to be
      EFAULT, since that was what get_user_pages() already used for the
      equivalent IO-mapped page case.
      
      [ Also removed an impossible test for pte_offset_map_lock() failing:
        that's not how that function works ]
      Acked-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89f5b7da
  5. 24 May, 2008 1 commit
  6. 14 May, 2008 1 commit
    • Nick Piggin's avatar
      fix SMP data race in pagetable setup vs walking · 362a61ad
      Nick Piggin authored
      
      
      There is a possible data race in the page table walking code. After the split
      ptlock patches, it actually seems to have been introduced to the core code, but
      even before that I think it would have impacted some architectures (powerpc
      and sparc64, at least, walk the page tables without taking locks eg. see
      find_linux_pte()).
      
      The race is as follows:
      The pte page is allocated, zeroed, and its struct page gets its spinlock
      initialized. The mm-wide ptl is then taken, and then the pte page is inserted
      into the pagetables.
      
      At this point, the spinlock is not guaranteed to have ordered the previous
      stores to initialize the pte page with the subsequent store to put it in the
      page tables. So another Linux page table walker might be walking down (without
      any locks, because we have split-leaf-ptls), and find that new pte we've
      inserted. It might try to take the spinlock before the store from the other
      CPU initializes it. And subsequently it might read a pte_t out before stores
      from the other CPU have cleared the memory.
      
      There are also similar races in higher levels of the page tables. They
      obviously don't involve the spinlock, but could see uninitialized memory.
      
      Arch code and hardware pagetable walkers that walk the pagetables without
      locks could see similar uninitialized memory problems, regardless of whether
      split ptes are enabled or not.
      
      I prefer to put the barriers in core code, because that's where the higher
      level logic happens, but the page table accessors are per-arch, and open-coding
      them everywhere I don't think is an option. I'll put the read-side barriers
      in alpha arch code for now (other architectures perform data-dependent loads
      in order).
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      362a61ad
  7. 06 May, 2008 1 commit
    • Hugh Dickins's avatar
      x86: fix PAE pmd_bad bootup warning · aeed5fce
      Hugh Dickins authored
      Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32.
      
      That came from 9fc34113 x86: debug pmd_bad();
      but we understand now that the typecasting was wrong for PAE in the previous
      version: pagetable pages above 4GB looked bad and stopped Arjan from booting.
      
      And revert that cded932b
      
       x86: fix pmd_bad
      and pud_bad to support huge pages.  It was the wrong way round: we shouldn't
      weaken every pmd_bad and pud_bad check to let huge pages slip through - in
      part they check that we _don't_ have a huge page where it's not expected.
      
      Put the x86 pmd_bad() and pud_bad() definitions back to what they have long
      been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking
      junk in the upper word is good; and x86_64 should follow x86_32's stricter
      comparison, to stop thinking any subset of required bits is good); but that
      should be a later patch.
      
      Fix Hans' good observation that follow_page() will never find pmd_huge()
      because that would have already failed the pmd_bad test: test pmd_huge in
      between the pmd_none and pmd_bad tests.  Tighten x86's pmd_huge() check?
      No, once it's a hugepage entry, it can get quite far from a good pmd: for
      example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.
      
      However... though follow_page() contains this and another test for huge
      pages, so it's nice to keep it working on them, where does it actually get
      called on a huge page?  get_user_pages() checks is_vm_hugetlb_page(vma) to
      to call alternative hugetlb processing, as does unmap_vmas() and others.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Earlier-version-tested-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jeff Chua <jeff.chua.linux@gmail.com>
      Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aeed5fce
  8. 28 Apr, 2008 4 commits
    • Nick Piggin's avatar
      mm: add vm_insert_mixed · 423bad60
      Nick Piggin authored
      
      
      vm_insert_mixed will insert either a raw pfn or a refcounted struct page into
      the page tables, depending on whether vm_normal_page() will return the page or
      not.  With the introduction of the new pte bit, this is now a too tricky for
      drivers to be doing themselves.
      
      filemap_xip uses this in a subsequent patch.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      423bad60
    • Nick Piggin's avatar
      mm: introduce pte_special pte bit · 7e675137
      Nick Piggin authored
      
      
      s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory
      model (which is more dynamic than most).  Instead, they had proposed to
      implement it with an additional path through vm_normal_page(), using a bit in
      the pte to determine whether or not the page should be refcounted:
      
      vm_normal_page()
      {
      	...
              if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
                      if (vma->vm_flags & VM_MIXEDMAP) {
      #ifdef s390
      			if (!mixedmap_refcount_pte(pte))
      				return NULL;
      #else
                              if (!pfn_valid(pfn))
                                      return NULL;
      #endif
                              goto out;
                      }
      	...
      }
      
      This is fine, however if we are allowed to use a bit in the pte to determine
      refcountedness, we can use that to _completely_ replace all the vma based
      schemes.  So instead of adding more cases to the already complex vma-based
      scheme, we can have a clearly seperate and simple pte-based scheme (and get
      slightly better code generation in the process):
      
      vm_normal_page()
      {
      #ifdef s390
      	if (!mixedmap_refcount_pte(pte))
      		return NULL;
      	return pte_page(pte);
      #else
      	...
      #endif
      }
      
      And finally, we may rather make this concept usable by any architecture rather
      than making it s390 only, so implement a new type of pte state for this.
      Unfortunately the old vma based code must stay, because some architectures may
      not be able to spare pte bits.  This makes vm_normal_page a little bit more
      ugly than we would like, but the 2 cases are clearly seperate.
      
      So introduce a pte_special pte state, and use it in mm/memory.c.  It is
      currently a noop for all architectures, so this doesn't actually result in any
      compiled code changes to mm/memory.o.
      
      BTW:
      I haven't put vm_normal_page() into arch code as-per an earlier suggestion.
      The reason is that, regardless of where vm_normal_page is actually
      implemented, the *abstraction* is still exactly the same. Also, while it
      depends on whether the architecture has pte_special or not, that is the
      only two possible cases, and it really isn't an arch specific function --
      the role of the arch code should be to provide primitive functions and
      accessors with which to build the core code; pte_special does that. We do
      not want architectures to know or care about vm_normal_page itself, and
      we definitely don't want them being able to invent something new there
      out of sight of mm/ code. If we made vm_normal_page an arch function, then
      we have to make vm_insert_mixed (next patch) an arch function too. So I
      don't think moving it to arch code fundamentally improves any abstractions,
      while it does practically make the code more difficult to follow, for both
      mm and arch developers, and easier to misuse.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarCarsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e675137
    • Jared Hulbert's avatar
      mm: introduce VM_MIXEDMAP · b379d790
      Jared Hulbert authored
      
      
      This series introduces some important infrastructure work.  The overall result
      is that:
      
      1. We now support XIP backed filesystems using memory that have no
         struct page allocated to them. And patches 6 and 7 actually implement
         this for s390.
      
         This is pretty important in a number of cases. As far as I understand,
         in the case of virtualisation (eg. s390), each guest may mount a
         readonly copy of the same filesystem (eg. the distro). Currently,
         guests need to allocate struct pages for this image. So if you have
         100 guests, you already need to allocate more memory for the struct
         pages than the size of the image. I think. (Carsten?)
      
         For other (eg. embedded) systems, you may have a very large non-
         volatile filesystem. If you have to have struct pages for this, then
         your RAM consumption will go up proportionally to fs size. Even
         though it is just a small proportion, the RAM can be much more costly
         eg in terms of power, so every KB less that Linux uses makes it more
         attractive to a lot of these guys.
      
      2. VM_MIXEDMAP allows us to support mappings where you actually do want
         to refcount _some_ pages in the mapping, but not others, and support
         COW on arbitrary (non-linear) mappings. Jared needs this for his NVRAM
         filesystem in progress. Future iterations of this filesystem will
         most likely want to migrate pages between pagecache and XIP backing,
         which is where the requirement for mixed (some refcounted, some not)
         comes from.
      
      3. pte_special also has a peripheral usage that I need for my lockless
         get_user_pages patch. That was shown to speed up "oltp" on db2 by
         10% on a 2 socket system, which is kind of significant because they
         scrounge for months to try to find 0.1% improvement on these
         workloads. I'm hoping we might finally be faster than AIX on
         pSeries with this :). My reference to lockless get_user_pages is not
         meant to justify this patchset (which doesn't include lockless gup),
         but just to show that pte_special is not some s390 specific thing that
         should be hidden in arch code or xip code: I definitely want to use it
         on at least x86 and powerpc as well.
      
      This patch:
      
      Introduce a new type of mapping, VM_MIXEDMAP.  This is unlike VM_PFNMAP in
      that it can support COW mappings of arbitrary ranges including ranges without
      struct page *and* ranges with a struct page that we actually want to refcount
      (PFNMAP can only support COW in those cases where the un-COW-ed translations
      are mapped linearly in the virtual address, and can only support non
      refcounted ranges).
      
      VM_MIXEDMAP achieves this by refcounting all pfn_valid pages, and not
      refcounting !pfn_valid pages (which is not an option for VM_PFNMAP, because it
      needs to avoid refcounting pfn_valid pages eg.  for /dev/mem mappings).
      Signed-off-by: default avatarJared Hulbert <jaredeh@gmail.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarCarsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b379d790
    • Nick Piggin's avatar
      mm: remove nopage · 3c18ddd1
      Nick Piggin authored
      
      
      Nothing in the tree uses nopage any more.  Remove support for it in the
      core mm code and documentation (and a few stray references to it in
      comments).
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c18ddd1
  9. 05 Mar, 2008 2 commits
  10. 15 Feb, 2008 1 commit
  11. 14 Feb, 2008 1 commit
    • Ingo Molnar's avatar
      x86: fix "BUG: sleeping function called from invalid context" in print_vma_addr() · e8bff74a
      Ingo Molnar authored
      Jiri Kosina reported the following deadlock scenario with
      show_unhandled_signals enabled:
      
       [   68.379022] gnome-settings-[2941] trap int3 ip:3d2c840f34
       sp:7fff36f5d100 error:0<3>BUG: sleeping function called from invalid
       context at kernel/rwsem.c:21
       [   68.379039] in_atomic():1, irqs_disabled():0
       [   68.379044] no locks held by gnome-settings-/2941.
       [   68.379050] Pid: 2941, comm: gnome-settings- Not tainted 2.6.25-rc1 #30
       [   68.379054]
       [   68.379056] Call Trace:
       [   68.379061]  <#DB>  [<ffffffff81064883>] ? __debug_show_held_locks+0x13/0x30
       [   68.379109]  [<ffffffff81036765>] __might_sleep+0xe5/0x110
       [   68.379123]  [<ffffffff812f2240>] down_read+0x20/0x70
       [   68.379137]  [<ffffffff8109cdca>] print_vma_addr+0x3a/0x110
       [   68.379152]  [<ffffffff8100f435>] do_trap+0xf5/0x170
       [   68.379168]  [<ffffffff8100f52b>] do_int3+0x7b/0xe0
       [   68.379180]  [<ffffffff812f4a6f>] int3+0x9f/0xd0
       [   68.379203]  <<EOE>>
       [   68.379229]  in libglib-2.0.so.0.1505.0[3d2c800000+dc000]
      
      and tracked it down to:
      
        commit 03252919
      
      
        Author: Andi Kleen <ak@suse.de>
        Date:   Wed Jan 30 13:33:18 2008 +0100
      
            x86: print which shared library/executable faulted in segfault etc. messages
      
      the problem is that we call down_read() from an atomic context.
      
      Solve this by returning from print_vma_addr() if the preempt count is
      elevated. Update preempt_conditional_sti / preempt_conditional_cli to
      unconditionally lift the preempt count even on !CONFIG_PREEMPT.
      Reported-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e8bff74a
  12. 12 Feb, 2008 1 commit
    • Jonathan Corbet's avatar
      Be more robust about bad arguments in get_user_pages() · 900cf086
      Jonathan Corbet authored
      
      
      So I spent a while pounding my head against my monitor trying to figure
      out the vmsplice() vulnerability - how could a failure to check for
      *read* access turn into a root exploit? It turns out that it's a buffer
      overflow problem which is made easy by the way get_user_pages() is
      coded.
      
      In particular, "len" is a signed int, and it is only checked at the
      *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
      will execute once and decrement len to -1.  At that point, the loop will
      proceed until the next invalid address is found; in the process, it will
      likely overflow the pages array passed in to get_user_pages().
      
      I think that, if get_user_pages() has been asked to grab zero pages,
      that's what it should do.  Thus this patch; it is, among other things,
      enough to block the (already fixed) root exploit and any others which
      might be lurking in similar code.  I also think that the number of pages
      should be unsigned, but changing the prototype of this function probably
      requires some more careful review.
      Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      900cf086
  13. 08 Feb, 2008 1 commit
    • Martin Schwidefsky's avatar
      CONFIG_HIGHPTE vs. sub-page page tables. · 2f569afd
      Martin Schwidefsky authored
      
      
      Background: I've implemented 1K/2K page tables for s390.  These sub-page
      page tables are required to properly support the s390 virtualization
      instruction with KVM.  The SIE instruction requires that the page tables
      have 256 page table entries (pte) followed by 256 page status table entries
      (pgste).  The pgstes are only required if the process is using the SIE
      instruction.  The pgstes are updated by the hardware and by the hypervisor
      for a number of reasons, one of them is dirty and reference bit tracking.
      To avoid wasting memory the standard pte table allocation should return
      1K/2K (31/64 bit) and 2K/4K if the process is using SIE.
      
      Problem: Page size on s390 is 4K, page table size is 1K or 2K.  That means
      the s390 version for pte_alloc_one cannot return a pointer to a struct
      page.  Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
      cannot return a pointer to a pte either, since that would require more than
      32 bit for the return value of pte_alloc_one (and the pte * would not be
      accessible since its not kmapped).
      
      Solution: The only solution I found to this dilemma is a new typedef: a
      pgtable_t.  For s390 pgtable_t will be a (pte *) - to be introduced with a
      later patch.  For everybody else it will be a (struct page *).  The
      additional problem with the initialization of the ptl lock and the
      NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
      a destructor pgtable_page_dtor.  The page table allocation and free
      functions need to call these two whenever a page table page is allocated or
      freed.  pmd_populate will get a pgtable_t instead of a struct page pointer.
       To get the pgtable_t back from a pmd entry that has been installed with
      pmd_populate a new function pmd_pgtable is added.  It replaces the pmd_page
      call in free_pte_range and apply_to_pte_range.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f569afd
  14. 07 Feb, 2008 2 commits
    • Balbir Singh's avatar
      Memory controller: make charging gfp mask aware · e1a1cd59
      Balbir Singh authored
      
      
      Nick Piggin pointed out that swap cache and page cache addition routines
      could be called from non GFP_KERNEL contexts.  This patch makes the
      charging routine aware of the gfp context.  Charging might fail if the
      cgroup is over it's limit, in which case a suitable error is returned.
      
      This patch was tested on a Powerpc box.  I am still looking at being able
      to test the path, through which allocations happen in non GFP_KERNEL
      contexts.
      
      [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1a1cd59
    • Balbir Singh's avatar
      Memory controller: memory accounting · 8a9f3ccd
      Balbir Singh authored
      
      
      Add the accounting hooks.  The accounting is carried out for RSS and Page
      Cache (unmapped) pages.  There is now a common limit and accounting for both.
      The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
      time.  Page cache is accounted at add_to_page_cache(),
      __delete_from_page_cache().  Swap cache is also accounted for.
      
      Each page's page_cgroup is protected with the last bit of the
      page_cgroup pointer, this makes handling of race conditions involving
      simultaneous mappings of a page easier.  A reference count is kept in the
      page_cgroup to deal with cases where a page might be unmapped from the RSS
      of all tasks, but still lives in the page cache.
      
      Credits go to Vaidyanathan Srinivasan for helping with reference counting work
      of the page cgroup.  Almost all of the page cache accounting code has help
      from Vaidyanathan Srinivasan.
      
      [hugh@veritas.com: fix swapoff breakage]
      [akpm@linux-foundation.org: fix locking]
      Signed-off-by: default avatarVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <Valdis.Kletnieks@vt.edu>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a9f3ccd
  15. 06 Feb, 2008 1 commit
  16. 05 Feb, 2008 8 commits
    • Nick Piggin's avatar
      mm: fix PageUptodate data race · 0ed361de
      Nick Piggin authored
      
      
      After running SetPageUptodate, preceeding stores to the page contents to
      actually bring it uptodate may not be ordered with the store to set the
      page uptodate.
      
      Therefore, another CPU which checks PageUptodate is true, then reads the
      page contents can get stale data.
      
      Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
      PageUptodate.
      
      Many places that test PageUptodate, do so with the page locked, and this
      would be enough to ensure memory ordering in those places if
      SetPageUptodate were only called while the page is locked.  Unfortunately
      that is not always the case for some filesystems, but it could be an idea
      for the future.
      
      Also bring the handling of anonymous page uptodateness in line with that of
      file backed page management, by marking anon pages as uptodate when they
      _are_ uptodate, rather than when our implementation requires that they be
      marked as such.  Doing allows us to get rid of the smp_wmb's in the page
      copying functions, which were especially added for anonymous pages for an
      analogous memory ordering problem.  Both file and anonymous pages are
      handled with the same barriers.
      
      FAQ:
      Q. Why not do this in flush_dcache_page?
      A. Firstly, flush_dcache_page handles only one side (the smb side) of the
      ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
      memory barriers in a completely unrelated function is nasty; at least in the
      PageUptodate macros, they are located together with (half) the operations
      involved in the ordering. Thirdly, the smp_wmb is only required when first
      bringing the page uptodate, wheras flush_dcache_page should be called each time
      it is written to through the kernel mapping. It is logically the wrong place to
      put it.
      
      Q. Why does this increase my text size / reduce my performance / etc.
      A. Because it is adding the necessary instructions to eliminate the data-race.
      
      Q. Can it be improved?
      A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
      run under the page lock, we could avoid the smp_rmb places where PageUptodate
      is queried under the page lock. Requires audit of all filesystems and at least
      some would need reworking. That's great you're interested, I'm eagerly awaiting
      your patches.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ed361de
    • Harvey Harrison's avatar
      mm: remove fastcall from mm/ · 920c7a5d
      Harvey Harrison authored
      
      
      fastcall is always defined to be empty, remove it
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      920c7a5d
    • Benjamin Herrenschmidt's avatar
      add mm argument to pte/pmd/pud/pgd_free · 5e541973
      Benjamin Herrenschmidt authored
      
      
      (with Martin Schwidefsky <schwidefsky@de.ibm.com>)
      
      The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
      first argument.  The free functions do not get the mm_struct argument.  This
      is 1) asymmetrical and 2) to do mm related page table allocations the mm
      argument is needed on the free function as well.
      
      [kamalesh@linux.vnet.ibm.com: i386 fix]
      [akpm@linux-foundation.org: coding-syle fixes]
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e541973
    • Christoph Hellwig's avatar
      clean up vmtruncate · 61d5048f
      Christoph Hellwig authored
      
      
      vmtruncate is a twisted maze of gotos, this patch cleans it up to have a
      proper if else for the two major cases of extending and truncating truncate
      and thus makes it a lot more readable while keeping exactly the same
      functinality.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61d5048f
    • Hugh Dickins's avatar
      swapin needs gfp_mask for loop on tmpfs · 02098fea
      Hugh Dickins authored
      
      
      Building in a filesystem on a loop device on a tmpfs file can hang when
      swapping, the loop thread caught in that infamous throttle_vm_writeout.
      
      In theory this is a long standing problem, which I've either never seen in
      practice, or long ago suppressed the recollection, after discounting my load
      and my tmpfs size as unrealistically high.  But now, with the new aops, it has
      become easy to hang on one machine.
      
      Loop used to grab_cache_page before the old prepare_write to tmpfs, which
      seems to have been enough to free up some memory for any swapin needed; but
      the new write_begin lets tmpfs find or allocate the page (much nicer, since
      grab_cache_page missed tmpfs pages in swapcache).
      
      When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which
      has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is designed to
      break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in,
      read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the
      mapping_gfp_mask - hence the hang.
      
      So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to
      swapin_readahead to read_swap_cache_async to add_to_swap_cache.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02098fea
    • Hugh Dickins's avatar
      swapin_readahead: move and rearrange args · 46017e95
      Hugh Dickins authored
      
      
      swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c
      beside its kindred read_swap_cache_async.  Why were its args in a different
      order?  rearrange them.  And since it was always followed by a
      read_swap_cache_async of the target page, fold that in and return struct
      page*.  Then CONFIG_SWAP=n no longer needs valid_swaphandles and
      read_swap_cache_async stubs.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46017e95
    • Hugh Dickins's avatar
      swapin_readahead: excise NUMA bogosity · c4cc6d07
      Hugh Dickins authored
      
      
      For three years swapin_readahead has been cluttered with fanciful CONFIG_NUMA
      code, advancing addr, and stepping on to the next vma at the boundary, to line
      up the mempolicy for each page allocation.
      
      It _might_ be a good idea to allocate swap more according to vma layout; but
      the fact is, that's not how we do it at all, 2.6 even less than 2.4: swap is
      allocated as needed for pages as they sink to the bottom of the inactive LRUs.
       Sometimes that may match vma layout, but not so often that it's worth going
      to these misleading vma->vm_next lengths: rip all that out.
      
      Originally I intended to retain the incrementation of addr, but correct its
      initial value: valid_swaphandles generally supplies an offset below the target
      addr (this is readaround rather than readahead), but addr has not been
      adjusted accordingly, so in the interleave case it has usually been allocating
      the target page from the "wrong" node (though that may not matter very much).
      
      But look at the equivalent shmem_swapin code: either by oversight or by
      design, though it has all the apparatus for choosing a new mempolicy per page,
      it uses the same idx throughout, choosing the same mempolicy and interleave
      node for each page of the cluster.
      
      Which is actually a much better strategy: each node has its own LRUs and its
      own kswapd, so if you're betting on any particular relationship between swap
      and node, the best bet is that nearby swap entries belong to pages from the
      same node - even when the mempolicy of the target page is to interleave.  And
      examining a map of nodes corresponding to swap entries on a numa=fake system
      bears this out.  (We could later tweak swap allocation to make it even more
      likely, but this patch is merely about removing cruft.)
      
      So, neither adjust nor increment addr in swapin_readahead, and then
      shmem_swapin can use it too; the pseudo-vma to pass policy need only be set up
      once per cluster, and so few fields of pvma are used, let's skip the memset -
      from shmem_alloc_page also.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4cc6d07
    • Christoph Lameter's avatar
      Move vmalloc_to_page() to mm/vmalloc. · 48667e7a
      Christoph Lameter authored
      
      
      We already have page table manipulation for vmalloc in vmalloc.c. Move the
      vmalloc_to_page() function there as well.
      
      Move the definitions for vmalloc related functions in mm.h to a newly created
      section.  A better place would be vmalloc.h but mm.h is basic and may depend
      on these functions.  An alternative would be to include vmalloc.h in mm.h
      (like done for vmstat.h).
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48667e7a
  17. 30 Jan, 2008 2 commits
    • Andi Kleen's avatar
      x86: print which shared library/executable faulted in segfault etc. messages v3 · 03252919
      Andi Kleen authored
      
      
      They now look like:
      
      hal-resmgr[13791]: segfault at 3c rip 2b9c8caec182 rsp 7fff1e825d30 error 4 in libacl.so.1.1.0[2b9c8caea000+6000]
      
      This makes it easier to pinpoint bugs to specific libraries.
      
      And printing the offset into a mapping also always allows to find the
      correct fault point in a library even with randomized mappings. Previously
      there was no way to actually find the correct code address inside
      the randomized mapping.
      
      Relies on earlier patch to shorten the printk formats.
      
      They are often now longer than 80 characters, but I think that's worth it.
      
      [includes fix from Eric Dumazet to check d_path error value]
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      03252919
    • Nick Piggin's avatar
      spinlock: lockbreak cleanup · 95c354fe
      Nick Piggin authored
      
      
      The break_lock data structure and code for spinlocks is quite nasty.
      Not only does it double the size of a spinlock but it changes locking to
      a potentially less optimal trylock.
      
      Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
      __raw_spin_is_contended that uses the lock data itself to determine whether
      there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
      not set.
      
      Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
      decouple it from the spinlock implementation, and make it typesafe (rwlocks
      do not have any need_lockbreak sites -- why do they even get bloated up
      with that break_lock then?).
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      95c354fe
  18. 23 Jan, 2008 1 commit
  19. 17 Jan, 2008 1 commit
    • Carsten Otte's avatar
      #ifdef very expensive debug check in page fault path · 9723198c
      Carsten Otte authored
      
      
      This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page
      that verifies that a pfn is valid.  This patch increases performance of the
      page fault microbenchmark in lmbench by 13% and overall dbench performance
      by 7% on s390x.  pfn_valid() is an expensive operation on s390 that needs a
      high double digit amount of CPU cycles.  Nick Piggin suggested that
      pfn_valid() involves an array lookup on systems with sparsemem, and
      therefore is an expensive operation there too.
      
      The check looks like a clear debug thing to me, it should never trigger on
      regular kernels.  And if a pte is created for an invalid pfn, we'll find
      out once the memory gets accessed later on anyway.  Please consider
      inclusion of this patch into mm.
      Signed-off-by: default avatarCarsten Otte <cotte@de.ibm.com>
      Acked-by: default avatarNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9723198c
  20. 15 Nov, 2007 1 commit