1. 11 Dec, 2014 2 commits
    • Johannes Weiner's avatar
      mm: page_cgroup: rename file to mm/swap_cgroup.c · 5d1ea48b
      Johannes Weiner authored
      
      
      Now that the external page_cgroup data structure and its lookup is gone,
      the only code remaining in there is swap slot accounting.
      
      Rename it and move the conditional compilation into mm/Makefile.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d1ea48b
    • Johannes Weiner's avatar
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner authored
      
      
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
  2. 10 Oct, 2014 3 commits
  3. 18 Aug, 2014 1 commit
    • Josh Triplett's avatar
      mm: Support compiling out madvise and fadvise · d3ac21ca
      Josh Triplett authored
      
      
      Many embedded systems will not need these syscalls, and omitting them
      saves space.  Add a new EXPERT config option CONFIG_ADVISE_SYSCALLS
      (default y) to support compiling them out.
      
      bloat-o-meter:
      add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-2250 (-2250)
      function                                     old     new   delta
      sys_fadvise64                                 57       -     -57
      sys_fadvise64_64                             691       -    -691
      sys_madvise                                 1502       -   -1502
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      d3ac21ca
  4. 07 Aug, 2014 2 commits
  5. 04 Jun, 2014 1 commit
  6. 20 May, 2014 1 commit
  7. 07 Apr, 2014 2 commits
    • Mark Salter's avatar
      mm: create generic early_ioremap() support · 9e5c33d7
      Mark Salter authored
      
      
      This patch creates a generic implementation of early_ioremap() support
      based on the existing x86 implementation.  early_ioremp() is useful for
      early boot code which needs to temporarily map I/O or memory regions
      before normal mapping functions such as ioremap() are available.
      
      Some architectures have optional MMU.  In the no-MMU case, the remap
      functions simply return the passed in physical address and the unmap
      functions do nothing.
      Signed-off-by: default avatarMark Salter <msalter@redhat.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e5c33d7
    • Davidlohr Bueso's avatar
      mm: per-thread vma caching · 615d6e87
      Davidlohr Bueso authored
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410
      
      ), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      615d6e87
  8. 03 Apr, 2014 1 commit
    • Johannes Weiner's avatar
      mm: thrash detection-based file cache sizing · a528910e
      Johannes Weiner authored
      
      
      The VM maintains cached filesystem pages on two types of lists.  One
      list holds the pages recently faulted into the cache, the other list
      holds pages that have been referenced repeatedly on that first list.
      The idea is to prefer reclaiming young pages over those that have shown
      to benefit from caching in the past.  We call the recently usedbut
      ultimately was not significantly better than a FIFO policy and still
      thrashed cache based on eviction speed, rather than actual demand for
      cache.
      
      This patch solves one half of the problem by decoupling the ability to
      detect working set changes from the inactive list size.  By maintaining
      a history of recently evicted file pages it can detect frequently used
      pages with an arbitrarily small inactive list size, and subsequently
      apply pressure on the active list based on actual demand for cache, not
      just overall eviction speed.
      
      Every zone maintains a counter that tracks inactive list aging speed.
      When a page is evicted, a snapshot of this counter is stored in the
      now-empty page cache radix tree slot.  On refault, the minimum access
      distance of the page can be assessed, to evaluate whether the page
      should be part of the active list or not.
      
      This fixes the VM's blindness towards working set changes in excess of
      the inactive list.  And it's the foundation to further improve the
      protection ability and reduce the minimum inactive list size of 50%.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a528910e
  9. 02 Apr, 2014 1 commit
  10. 31 Jan, 2014 1 commit
    • Minchan Kim's avatar
      zsmalloc: move it under mm · bcf1647d
      Minchan Kim authored
      
      
      This patch moves zsmalloc under mm directory.
      
      Before that, description will explain why we have needed custom
      allocator.
      
      Zsmalloc is a new slab-based memory allocator for storing compressed
      pages.  It is designed for low fragmentation and high allocation success
      rate on large object, but <= PAGE_SIZE allocations.
      
      zsmalloc differs from the kernel slab allocator in two primary ways to
      achieve these design goals.
      
      zsmalloc never requires high order page allocations to back slabs, or
      "size classes" in zsmalloc terms.  Instead it allows multiple
      single-order pages to be stitched together into a "zspage" which backs
      the slab.  This allows for higher allocation success rate under memory
      pressure.
      
      Also, zsmalloc allows objects to span page boundaries within the zspage.
      This allows for lower fragmentation than could be had with the kernel
      slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
      kernel slab allocator, if a page compresses to 60% of it original size,
      the memory savings gained through compression is lost in fragmentation
      because another object of the same size can't be stored in the leftover
      space.
      
      This ability to span pages results in zsmalloc allocations not being
      directly addressable by the user.  The user is given an
      non-dereferencable handle in response to an allocation request.  That
      handle must be mapped, using zs_map_object(), which returns a pointer to
      the mapped region that can be used.  The mapping is necessary since the
      object data may reside in two different noncontigious pages.
      
      The zsmalloc fulfills the allocation needs for zram perfectly
      
      [sjenning@linux.vnet.ibm.com: borrow Seth's quote]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarNitin Gupta <ngupta@vflare.org>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bcf1647d
  11. 10 Sep, 2013 1 commit
    • Dave Chinner's avatar
      list: add a new LRU list type · a38e4082
      Dave Chinner authored
      
      
      Several subsystems use the same construct for LRU lists - a list head, a
      spin lock and and item count.  They also use exactly the same code for
      adding and removing items from the LRU.  Create a generic type for these
      LRU lists.
      
      This is the beginning of generic, node aware LRUs for shrinkers to work
      with.
      
      [glommer@openvz.org: enum defined constants for lru. Suggested by gthelen, don't relock over retry]
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Reviewed-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a38e4082
  12. 11 Jul, 2013 2 commits
    • Seth Jennings's avatar
      zswap: add to mm/ · 2b281117
      Seth Jennings authored
      
      
      zswap is a thin backend for frontswap that takes pages that are in the
      process of being swapped out and attempts to compress them and store
      them in a RAM-based memory pool.  This can result in a significant I/O
      reduction on the swap device and, in the case where decompressing from
      RAM is faster than reading from the swap device, can also improve
      workload performance.
      
      It also has support for evicting swap pages that are currently
      compressed in zswap to the swap device on an LRU(ish) basis.  This
      functionality makes zswap a true cache in that, once the cache is full,
      the oldest pages can be moved out of zswap to the swap device so newer
      pages can be compressed and stored in zswap.
      
      This patch adds the zswap driver to mm/
      Signed-off-by: default avatarSeth Jennings <sjenning@linux.vnet.ibm.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Jenifer Hopper <jhopper@us.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Hugh Dickens <hughd@google.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b281117
    • Seth Jennings's avatar
      zbud: add to mm/ · 4e2e2770
      Seth Jennings authored
      
      
      zbud is an special purpose allocator for storing compressed pages.  It
      is designed to store up to two compressed pages per physical page.
      While this design limits storage density, it has simple and
      deterministic reclaim properties that make it preferable to a higher
      density approach when reclaim will be used.
      
      zbud works by storing compressed pages, or "zpages", together in pairs
      in a single memory page called a "zbud page".  The first buddy is "left
      justifed" at the beginning of the zbud page, and the last buddy is
      "right justified" at the end of the zbud page.  The benefit is that if
      either buddy is freed, the freed buddy space, coalesced with whatever
      slack space that existed between the buddies, results in the largest
      possible free region within the zbud page.
      
      zbud also provides an attractive lower bound on density.  The ratio of
      zpages to zbud pages can not be less than 1.  This ensures that zbud can
      never "do harm" by using more pages to store zpages than the
      uncompressed zpages would have used on their own.
      
      This implementation is a rewrite of the zbud allocator internally used
      by zcache in the driver/staging tree.  The rewrite was necessary to
      remove some of the zcache specific elements that were ingrained
      throughout and provide a generic allocation interface that can later be
      used by zsmalloc and others.
      
      This patch adds zbud to mm/ for later use by zswap.
      Signed-off-by: default avatarSeth Jennings <sjenning@linux.vnet.ibm.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Jenifer Hopper <jhopper@us.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Hugh Dickens <hughd@google.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e2e2770
  13. 29 Apr, 2013 1 commit
    • Anton Vorontsov's avatar
      memcg: add memory.pressure_level events · 70ddf637
      Anton Vorontsov authored
      With this patch userland applications that want to maintain the
      interactivity/memory allocation cost can use the pressure level
      notifications.  The levels are defined like this:
      
      The "low" level means that the system is reclaiming memory for new
      allocations.  Monitoring this reclaiming activity might be useful for
      maintaining cache level.  Upon notification, the program (typically
      "Activity Manager") might analyze vmstat and act in advance (i.e.
      prematurely shutdown unimportant services).
      
      The "medium" level means that the system is experiencing medium memory
      pressure, the system might be making swap, paging out active file
      caches, etc.  Upon this event applications may decide to further analyze
      vmstat/zoneinfo/memcg or internal memory usage statistics and free any
      resources that can be easily reconstructed or re-read from a disk.
      
      The "critical" level means that the system is actively thrashing, it is
      about to out of memory (OOM) or even the in-kernel OOM killer is on its
      way to trigger.  Applications should do whatever they can to help the
      system.  It might be too late to consult with vmstat or any other
      statistics, so it's advisable to take an immediate action.
      
      The events are propagated upward until the event is handled, i.e.  the
      events are not pass-through.  Here is what this means: for example you
      have three cgroups: A->B->C.  Now you set up an event listener on
      cgroups A, B and C, and suppose group C experiences some pressure.  In
      this situation, only group C will receive the notification, i.e.  groups
      A and B will not receive it.  This is done to avoid excessive
      "broadcasting" of messages, which disturbs the system and which is
      especially bad if we are low on memory or thrashing.  So, organize the
      cgroups wisely, or propagate the events manually (or, ask us to
      implement the pass-through events, explaining why would you need them.)
      
      Performance wise, the memory pressure notifications feature itself is
      lightweight and does not require much of bookkeeping, in contrast to the
      rest of memcg features.  Unfortunately, as of current memcg
      implementation, pages accounting is an inseparable part and cannot be
      turned off.  The good news is that there are some efforts[1] to improve
      the situation; plus, implementing the same, fully API-compatible[2]
      interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
      option, so it will not require any changes on the userland side.
      
      [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
      [2] http://lkml.org/lkml/2013/2/21/454
      
      
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
      Signed-off-by: default avatarAnton Vorontsov <anton.vorontsov@linaro.org>
      Acked-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70ddf637
  14. 12 Dec, 2012 1 commit
    • Rafael Aquini's avatar
      mm: introduce a common interface for balloon pages mobility · 18468d93
      Rafael Aquini authored
      
      
      Memory fragmentation introduced by ballooning might reduce significantly
      the number of 2MB contiguous memory blocks that can be used within a guest,
      thus imposing performance penalties associated with the reduced number of
      transparent huge pages that could be used by the guest workload.
      
      This patch introduces a common interface to help a balloon driver on
      making its page set movable to compaction, and thus allowing the system
      to better leverage the compation efforts on memory defragmentation.
      
      [akpm@linux-foundation.org: use PAGE_FLAGS_CHECK_AT_PREP, s/__balloon_page_flags/page_flags_cleared/, small cleanups]
      [rientjes@google.com: allow balloon compaction for any system with memory compaction enabled, which is the defconfig]
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18468d93
  15. 09 Oct, 2012 1 commit
    • Michel Lespinasse's avatar
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse authored
      
      
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
  16. 01 Aug, 2012 3 commits
  17. 09 Jul, 2012 1 commit
  18. 29 May, 2012 2 commits
  19. 21 May, 2012 1 commit
  20. 15 May, 2012 1 commit
  21. 01 Nov, 2011 1 commit
    • Christopher Yeoh's avatar
      Cross Memory Attach · fcf63409
      Christopher Yeoh authored
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      copying.
      
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      
      http://marc.info/?l=linux-mm&m=130105930902915&w=2
      
      There is a basic man page for the proposed interface here:
      
      http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
      
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz
      
      Signed-off-by: default avatarChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcf63409
  22. 26 May, 2011 1 commit
    • Dan Magenheimer's avatar
      mm: cleancache core ops functions and config · 077b1f83
      Dan Magenheimer authored
      
      
      This third patch of eight in this cleancache series provides
      the core code for cleancache that interfaces between the hooks in
      VFS and individual filesystems and a cleancache backend.  It also
      includes build and config patches.
      
      Two new files are added: mm/cleancache.c and include/linux/cleancache.h.
      
      Note that CONFIG_CLEANCACHE can default to on; in systems that do
      not provide a cleancache backend, all hooks devolve to a simple
      check of a global enable flag, so performance impact should
      be negligible but can be reduced to zero impact if config'ed off.
      However for this first commit, it defaults to off.
      
      Details and a FAQ can be found in Documentation/vm/cleancache.txt
      
      Credits: Cleancache_ops design derived from Jeremy Fitzhardinge
      design for tmem
      
      [v8: dan.magenheimer@oracle.com: fix exportfs call affecting btrfs]
      [v8: akpm@linux-foundation.org: use static inline function, not macro]
      [v7: dan.magenheimer@oracle.com: cleanup sysfs and remove cleancache prefix]
      [v6: JBeulich@novell.com: robustly handle buggy fs encode_fh actor definition]
      [v5: jeremy@goop.org: clean up global usage and static var names]
      [v5: jeremy@goop.org: simplify init hook and any future fs init changes]
      [v5: hch@infradead.org: cleaner non-global interface for ops registration]
      [v4: adilger@sun.com: interface must support exportfs FS's]
      [v4: hch@infradead.org: interface must support 64-bit FS on 32-bit kernel]
      [v3: akpm@linux-foundation.org: use one ops struct to avoid pointer hops]
      [v3: akpm@linux-foundation.org: document and ensure PageLocked reqts are met]
      [v3: ngupta@vflare.org: fix success/fail codes, change funcs to void]
      [v2: viro@ZenIV.linux.org.uk: use sane types]
      Signed-off-by: default avatarDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: default avatarJeremy Fitzhardinge <jeremy@goop.org>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNitin Gupta <ngupta@vflare.org>
      Acked-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarAndreas Dilger <adilger@sun.com>
      Acked-by: default avatarJan Beulich <JBeulich@novell.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik Van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      077b1f83
  23. 24 Feb, 2011 1 commit
    • Yinghai Lu's avatar
      bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c · 09325873
      Yinghai Lu authored
      
      
      mm/bootmem.c contained code paths for both bootmem and no bootmem
      configurations.  They implement about the same set of APIs in
      different ways and as a result bootmem.c contains massive amount of
      #ifdef CONFIG_NO_BOOTMEM.
      
      Separate out CONFIG_NO_BOOTMEM code into mm/nobootmem.c.  As the
      common part is relatively small, duplicate them in nobootmem.c instead
      of creating a common file or ifdef'ing in bootmem.c.
      
      The followings are duplicated.
      
      * {min|max}_low_pfn, max_pfn, saved_max_pfn
      * free_bootmem_late()
      * ___alloc_bootmem()
      * __alloc_bootmem_low()
      
      The followings are applicable only to nobootmem and moved verbatim.
      
      * __free_pages_memory()
      * free_all_memory_core_early()
      
      The followings are not applicable to nobootmem and omitted in
      nobootmem.c.
      
      * reserve_bootmem_node()
      * reserve_bootmem()
      
      The rest split function bodies according to CONFIG_NO_BOOTMEM.
      
      Makefile is updated so that only either bootmem.c or nobootmem.c is
      built according to CONFIG_NO_BOOTMEM.
      
      This patch doesn't introduce any behavior change.
      
      -tj: Rewrote commit description.
      Suggested-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      09325873
  24. 14 Jan, 2011 2 commits
    • Andrea Arcangeli's avatar
      thp: transparent hugepage core · 71e3aac0
      Andrea Arcangeli authored
      
      
      Lately I've been working to make KVM use hugepages transparently without
      the usual restrictions of hugetlbfs.  Some of the restrictions I'd like to
      see removed:
      
      1) hugepages have to be swappable or the guest physical memory remains
         locked in RAM and can't be paged out to swap
      
      2) if a hugepage allocation fails, regular pages should be allocated
         instead and mixed in the same vma without any failure and without
         userland noticing
      
      3) if some task quits and more hugepages become available in the
         buddy, guest physical memory backed by regular pages should be
         relocated on hugepages automatically in regions under
         madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
         kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
         not null)
      
      4) avoidance of reservation and maximization of use of hugepages whenever
         possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
         1 machine with 1 database with 1 database cache with 1 database cache size
         known at boot time. It's definitely not feasible with a virtualization
         hypervisor usage like RHEV-H that runs an unknown number of virtual machines
         with an unknown size of each virtual machine with an unknown amount of
         pagecache that could be potentially useful in the host for guest not using
         O_DIRECT (aka cache=off).
      
      hugepages in the virtualization hypervisor (and also in the guest!) are
      much more important than in a regular host not using virtualization,
      becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
      to 19 in case only the hypervisor uses transparent hugepages, and they
      decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
      linux hypervisor and the linux guest both uses this patch (though the
      guest will limit the addition speedup to anonymous regions only for
      now...).  Even more important is that the tlb miss handler is much slower
      on a NPT/EPT guest than for a regular shadow paging or no-virtualization
      scenario.  So maximizing the amount of virtual memory cached by the TLB
      pays off significantly more with NPT/EPT than without (even if there would
      be no significant speedup in the tlb-miss runtime).
      
      The first (and more tedious) part of this work requires allowing the VM to
      handle anonymous hugepages mixed with regular pages transparently on
      regular anonymous vmas.  This is what this patch tries to achieve in the
      least intrusive possible way.  We want hugepages and hugetlb to be used in
      a way so that all applications can benefit without changes (as usual we
      leverage the KVM virtualization design: by improving the Linux VM at
      large, KVM gets the performance boost too).
      
      The most important design choice is: always fallback to 4k allocation if
      the hugepage allocation fails!  This is the _very_ opposite of some large
      pagecache patches that failed with -EIO back then if a 64k (or similar)
      allocation failed...
      
      Second important decision (to reduce the impact of the feature on the
      existing pagetable handling code) is that at any time we can split an
      hugepage into 512 regular pages and it has to be done with an operation
      that can't fail.  This way the reliability of the swapping isn't decreased
      (no need to allocate memory when we are short on memory to swap) and it's
      trivial to plug a split_huge_page* one-liner where needed without
      polluting the VM.  Over time we can teach mprotect, mremap and friends to
      handle pmd_trans_huge natively without calling split_huge_page*.  The fact
      it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
      (instead of the current void) we'd need to rollback the mprotect from the
      middle of it (ideally including undoing the split_vma) which would be a
      big change and in the very wrong direction (it'd likely be simpler not to
      call split_huge_page at all and to teach mprotect and friends to handle
      hugepages instead of rolling them back from the middle).  In short the
      very value of split_huge_page is that it can't fail.
      
      The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
      incremental and it'll just be an "harmless" addition later if this initial
      part is agreed upon.  It also should be noted that locking-wise replacing
      regular pages with hugepages is going to be very easy if compared to what
      I'm doing below in split_huge_page, as it will only happen when
      page_count(page) matches page_mapcount(page) if we can take the PG_lock
      and mmap_sem in write mode.  collapse_huge_page will be a "best effort"
      that (unlike split_huge_page) can fail at the minimal sign of trouble and
      we can try again later.  collapse_huge_page will be similar to how KSM
      works and the madvise(MADV_HUGEPAGE) will work similar to
      madvise(MADV_MERGEABLE).
      
      The default I like is that transparent hugepages are used at page fault
      time.  This can be changed with
      /sys/kernel/mm/transparent_hugepage/enabled.  The control knob can be set
      to three values "always", "madvise", "never" which mean respectively that
      hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
      or never used.  /sys/kernel/mm/transparent_hugepage/defrag instead
      controls if the hugepage allocation should defrag memory aggressively
      "always", only inside "madvise" regions, or "never".
      
      The pmd_trans_splitting/pmd_trans_huge locking is very solid.  The
      put_page (from get_user_page users that can't use mmu notifier like
      O_DIRECT) that runs against a __split_huge_page_refcount instead was a
      pain to serialize in a way that would result always in a coherent page
      count for both tail and head.  I think my locking solution with a
      compound_lock taken only after the page_first is valid and is still a
      PageHead should be safe but it surely needs review from SMP race point of
      view.  In short there is no current existing way to serialize the O_DIRECT
      final put_page against split_huge_page_refcount so I had to invent a new
      one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
      returns so...).  And I didn't want to impact all gup/gup_fast users for
      now, maybe if we change the gup interface substantially we can avoid this
      locking, I admit I didn't think too much about it because changing the gup
      unpinning interface would be invasive.
      
      If we ignored O_DIRECT we could stick to the existing compound refcounting
      code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
      (and any other mmu notifier user) would call it without FOLL_GET (and if
      FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
      current task mmu notifier list yet).  But O_DIRECT is fundamental for
      decent performance of virtualized I/O on fast storage so we can't avoid it
      to solve the race of put_page against split_huge_page_refcount to achieve
      a complete hugepage feature for KVM.
      
      Swap and oom works fine (well just like with regular pages ;).  MMU
      notifier is handled transparently too, with the exception of the young bit
      on the pmd, that didn't have a range check but I think KVM will be fine
      because the whole point of hugepages is that EPT/NPT will also use a huge
      pmd when they notice gup returns pages with PageCompound set, so they
      won't care of a range and there's just the pmd young bit to check in that
      case.
      
      NOTE: in some cases if the L2 cache is small, this may slowdown and waste
      memory during COWs because 4M of memory are accessed in a single fault
      instead of 8k (the payoff is that after COW the program can run faster).
      So we might want to switch the copy_huge_page (and clear_huge_page too) to
      not temporal stores.  I also extensively researched ways to avoid this
      cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
      up to 1M (I can send those patches that fully implemented prefault) but I
      concluded they're not worth it and they add an huge additional complexity
      and they remove all tlb benefits until the full hugepage has been faulted
      in, to save a little bit of memory and some cache during app startup, but
      they still don't improve substantially the cache-trashing during startup
      if the prefault happens in >4k chunks.  One reason is that those 4k pte
      entries copied are still mapped on a perfectly cache-colored hugepage, so
      the trashing is the worst one can generate in those copies (cow of 4k page
      copies aren't so well colored so they trashes less, but again this results
      in software running faster after the page fault).  Those prefault patches
      allowed things like a pte where post-cow pages were local 4k regular anon
      pages and the not-yet-cowed pte entries were pointing in the middle of
      some hugepage mapped read-only.  If it doesn't payoff substantially with
      todays hardware it will payoff even less in the future with larger l2
      caches, and the prefault logic would blot the VM a lot.  If one is
      emebdded transparent_hugepage can be disabled during boot with sysfs or
      with the boot commandline parameter transparent_hugepage=0 (or
      transparent_hugepage=2 to restrict hugepages inside madvise regions) that
      will ensure not a single hugepage is allocated at boot time.  It is simple
      enough to just disable transparent hugepage globally and let transparent
      hugepages be allocated selectively by applications in the MADV_HUGEPAGE
      region (both at page fault time, and if enabled with the
      collapse_huge_page too through the kernel daemon).
      
      This patch supports only hugepages mapped in the pmd, archs that have
      smaller hugepages will not fit in this patch alone.  Also some archs like
      power have certain tlb limits that prevents mixing different page size in
      the same regions so they will not fit in this framework that requires
      "graceful fallback" to basic PAGE_SIZE in case of physical memory
      fragmentation.  hugetlbfs remains a perfect fit for those because its
      software limits happen to match the hardware limits.  hugetlbfs also
      remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
      to be found not fragmented after a certain system uptime and that would be
      very expensive to defragment with relocation, so requiring reservation.
      hugetlbfs is the "reservation way", the point of transparent hugepages is
      not to have any reservation at all and maximizing the use of cache and
      hugepages at all times automatically.
      
      Some performance result:
      
      vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
      ages3
      memset page fault 1566023
      memset tlb miss 453854
      memset second tlb miss 453321
      random access tlb miss 41635
      random access second tlb miss 41658
      vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
      memset page fault 1566471
      memset tlb miss 453375
      memset second tlb miss 453320
      random access tlb miss 41636
      random access second tlb miss 41637
      vmx andrea # ./largepages3
      memset page fault 1566642
      memset tlb miss 453417
      memset second tlb miss 453313
      random access tlb miss 41630
      random access second tlb miss 41647
      vmx andrea # ./largepages3
      memset page fault 1566872
      memset tlb miss 453418
      memset second tlb miss 453315
      random access tlb miss 41618
      random access second tlb miss 41659
      vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
      vmx andrea # ./largepages3
      memset page fault 2182476
      memset tlb miss 460305
      memset second tlb miss 460179
      random access tlb miss 44483
      random access second tlb miss 44186
      vmx andrea # ./largepages3
      memset page fault 2182791
      memset tlb miss 460742
      memset second tlb miss 459962
      random access tlb miss 43981
      random access second tlb miss 43988
      
      ============
      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/time.h>
      
      #define SIZE (3UL*1024*1024*1024)
      
      int main()
      {
      	char *p = malloc(SIZE), *p2;
      	struct timeval before, after;
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset page fault %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset second tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	for (p2 = p; p2 < p+SIZE; p2 += 4096)
      		*p2 = 0;
      	gettimeofday(&after, NULL);
      	printf("random access tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	for (p2 = p; p2 < p+SIZE; p2 += 4096)
      		*p2 = 0;
      	gettimeofday(&after, NULL);
      	printf("random access second tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	return 0;
      }
      ============
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71e3aac0
    • Andrea Arcangeli's avatar
      thp: add pmd mangling generic functions · e2cda322
      Andrea Arcangeli authored
      
      
      Some are needed to build but not actually used on archs not supporting
      transparent hugepages.  Others like pmdp_clear_flush are used by x86 too.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2cda322
  25. 02 Oct, 2010 1 commit
    • Tejun Heo's avatar
      percpu: use percpu allocator on UP too · 9b8327bb
      Tejun Heo authored
      
      
      On UP, percpu allocations were redirected to kmalloc.  This has the
      following problems.
      
      * For certain amount of allocations (determined by
        PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
        allocator can be used before the usual kernel memory allocator is
        brought online.  On SMP, this is used to initialize the kernel
        memory allocator.
      
      * percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
        doesn't.  For example, workqueue makes use of larger alignments for
        cpu_workqueues.
      
      Currently, users of percpu allocators need to handle UP differently,
      which is somewhat fragile and ugly.  Other than small amount of
      memory, there isn't much to lose by enabling percpu allocator on UP.
      It can simply use kernel memory based chunk allocation which was added
      for SMP archs w/o MMUs.
      
      This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
      makes UP build use percpu-km.  As percpu addresses and kernel
      addresses are always identity mapped and static percpu variables don't
      need any special treatment, nothing is arch dependent and mm/percpu.c
      implements generic setup_per_cpu_areas() for UP.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      9b8327bb
  26. 08 Sep, 2010 1 commit
    • Tejun Heo's avatar
      percpu: use percpu allocator on UP too · bbddff05
      Tejun Heo authored
      
      
      On UP, percpu allocations were redirected to kmalloc.  This has the
      following problems.
      
      * For certain amount of allocations (determined by
        PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
        allocator can be used before the usual kernel memory allocator is
        brought online.  On SMP, this is used to initialize the kernel
        memory allocator.
      
      * percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
        doesn't.  For example, workqueue makes use of larger alignments for
        cpu_workqueues.
      
      Currently, users of percpu allocators need to handle UP differently,
      which is somewhat fragile and ugly.  Other than small amount of
      memory, there isn't much to lose by enabling percpu allocator on UP.
      It can simply use kernel memory based chunk allocation which was added
      for SMP archs w/o MMUs.
      
      This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
      makes UP build use percpu-km.  As percpu addresses and kernel
      addresses are always identity mapped and static percpu variables don't
      need any special treatment, nothing is arch dependent and mm/percpu.c
      implements generic setup_per_cpu_areas() for UP.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Acked-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      bbddff05
  27. 14 Jul, 2010 1 commit
  28. 25 May, 2010 1 commit
    • Mel Gorman's avatar
      mm: compaction: memory compaction core · 748446bb
      Mel Gorman authored
      
      
      This patch is the core of a mechanism which compacts memory in a zone by
      relocating movable pages towards the end of the zone.
      
      A single compaction run involves a migration scanner and a free scanner.
      Both scanners operate on pageblock-sized areas in the zone.  The migration
      scanner starts at the bottom of the zone and searches for all movable
      pages within each area, isolating them onto a private list called
      migratelist.  The free scanner starts at the top of the zone and searches
      for suitable areas and consumes the free pages within making them
      available for the migration scanner.  The pages isolated for migration are
      then migrated to the newly isolated free pages.
      
      [aarcange@redhat.com: Fix unsafe optimisation]
      [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      748446bb
  29. 30 Mar, 2010 1 commit
    • Tejun Heo's avatar
      percpu: don't implicitly include slab.h from percpu.h · de380b55
      Tejun Heo authored
      
      
      percpu.h has always been including slab.h to get k[mz]alloc/free() for
      UP inline implementation.  percpu.h being used by very low level
      headers including module.h and sched.h, this meant that a lot files
      unintentionally got slab.h inclusion.
      
      Lee Schermerhorn was trying to make topology.h use percpu.h and got
      bitten by this implicit inclusion.  The right thing to do is break
      this ultimately unnecessary dependency.  The previous patch added
      explicit inclusion of either gfp.h or slab.h to the source files using
      them.  This patch updates percpu.h such that slab.h is no longer
      included from percpu.h.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      de380b55
  30. 16 Dec, 2009 1 commit