1. 17 Mar, 2016 2 commits
    • Kirill A. Shutemov's avatar
      thp, vmstats: count deferred split events · f9719a03
      Kirill A. Shutemov authored
      
      
      Count how many times we put a THP in split queue.  Currently, it happens
      on partial unmap of a THP.
      
      Rapidly growing value can indicate that an application behaves
      unfriendly wrt THP: often fault in huge page and then unmap part of it.
      This leads to unnecessary memory fragmentation and the application may
      require tuning.
      
      The event also can help with debugging kernel [mis-]behaviour.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9719a03
    • Vlastimil Babka's avatar
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka authored
      
      
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
  2. 16 Jan, 2016 2 commits
    • Minchan Kim's avatar
      mm: support madvise(MADV_FREE) · 854e9ed0
      Minchan Kim authored
      Linux doesn't have an ability to free pages lazy while other OS already
      have been supported that named by madvise(MADV_FREE).
      
      The gain is clear that kernel can discard freed pages rather than
      swapping out or OOM if memory pressure happens.
      
      Without memory pressure, freed pages would be reused by userspace
      without another additional overhead(ex, page fault + allocation +
      zeroing).
      
      Jason Evans said:
      
      : Facebook has been using MAP_UNINITIALIZED
      : (https://lkml.org/lkml/2012/1/18/308
      
      ) in some of its applications for
      : several years, but there are operational costs to maintaining this
      : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
      : in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
      : increased throughput for much of our workload by ~5%, and although the
      : benefit has decreased using newer hardware and kernels, there is still
      : enough benefit that we cannot reasonably retire it without a replacement.
      :
      : Aside from Facebook operations, there are numerous broadly used
      : applications that would benefit from MADV_FREE.  The ones that immediately
      : come to mind are redis, varnish, and MariaDB.  I don't have much insight
      : into Android internals and development process, but I would hope to see
      : MADV_FREE support eventually end up there as well to benefit applications
      : linked with the integrated jemalloc.
      :
      : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
      : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
      : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
      : (and AIX, but I'm not sure it even compiles on AIX).  The lack of
      : MADV_FREE on Linux forced me down a long series of increasingly
      : sophisticated heuristics for madvise() volume reduction, and even so this
      : remains a common performance issue for people using jemalloc on Linux.
      : Please integrate MADV_FREE; many people will benefit substantially.
      
      How it works:
      
      When madvise syscall is called, VM clears dirty bit of ptes of the
      range.  If memory pressure happens, VM checks dirty bit of page table
      and if it found still "clean", it means it's a "lazyfree pages" so VM
      could discard the page instead of swapping out.  Once there was store
      operation for the page before VM peek a page to reclaim, dirty bit is
      set so VM can swap out the page instead of discarding.
      
      One thing we should notice is that basically, MADV_FREE relies on dirty
      bit in page table entry to decide whether VM allows to discard the page
      or not.  IOW, if page table entry includes marked dirty bit, VM
      shouldn't discard the page.
      
      However, as a example, if swap-in by read fault happens, page table
      entry doesn't have dirty bit so MADV_FREE could discard the page
      wrongly.
      
      For avoiding the problem, MADV_FREE did more checks with PageDirty and
      PageSwapCache.  It worked out because swapped-in page lives on swap
      cache and since it is evicted from the swap cache, the page has PG_dirty
      flag.  So both page flags check effectively prevent wrong discarding by
      MADV_FREE.
      
      However, a problem in above logic is that swapped-in page has PG_dirty
      still after they are removed from swap cache so VM cannot consider the
      page as freeable any more even if madvise_free is called in future.
      
      Look at below example for detail.
      
          ptr = malloc();
          memset(ptr);
          ..
          ..
          .. heavy memory pressure so all of pages are swapped out
          ..
          ..
          var = *ptr; -> a page swapped-in and could be removed from
                         swapcache. Then, page table doesn't mark
                         dirty bit and page descriptor includes PG_dirty
          ..
          ..
          madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
          ..
          ..
          ..
          .. heavy memory pressure again.
          .. In this time, VM cannot discard the page because the page
          .. has *PG_dirty*
      
      To solve the problem, this patch clears PG_dirty if only the page is
      owned exclusively by current process when madvise is called because
      PG_dirty represents ptes's dirtiness in several processes so we could
      clear it only if we own it exclusively.
      
      Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
      and hope glibc supports it) and jemalloc/tcmalloc already have supported
      the feature for other OS(ex, FreeBSD)
      
        barrios@blaptop:~/benchmark/ebizzy$ lscpu
        Architecture:          x86_64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Little Endian
        CPU(s):                12
        On-line CPU(s) list:   0-11
        Thread(s) per core:    1
        Core(s) per socket:    1
        Socket(s):             12
        NUMA node(s):          1
        Vendor ID:             GenuineIntel
        CPU family:            6
        Model:                 2
        Stepping:              3
        CPU MHz:               3200.185
        BogoMIPS:              6400.53
        Virtualization:        VT-x
        Hypervisor vendor:     KVM
        Virtualization type:   full
        L1d cache:             32K
        L1i cache:             32K
        L2 cache:              4096K
        NUMA node0 CPU(s):     0-11
        ebizzy benchmark(./ebizzy -S 10 -n 512)
      
        Higher avg is better.
      
         vanilla-jemalloc             MADV_free-jemalloc
      
        1 thread
        records: 10                   records: 10
        avg:   2961.90                avg:  12069.70
        std:     71.96(2.43%)         std:    186.68(1.55%)
        max:   3070.00                max:  12385.00
        min:   2796.00                min:  11746.00
      
        2 thread
        records: 10                   records: 10
        avg:   5020.00                avg:  17827.00
        std:    264.87(5.28%)         std:    358.52(2.01%)
        max:   5244.00                max:  18760.00
        min:   4251.00                min:  17382.00
      
        4 thread
        records: 10                   records: 10
        avg:   8988.80                avg:  27930.80
        std:   1175.33(13.08%)        std:   3317.33(11.88%)
        max:   9508.00                max:  30879.00
        min:   5477.00                min:  21024.00
      
        8 thread
        records: 10                   records: 10
        avg:  13036.50                avg:  33739.40
        std:    170.67(1.31%)         std:   5146.22(15.25%)
        max:  13371.00                max:  40572.00
        min:  12785.00                min:  24088.00
      
        16 thread
        records: 10                   records: 10
        avg:  11092.40                avg:  31424.20
        std:    710.60(6.41%)         std:   3763.89(11.98%)
        max:  12446.00                max:  36635.00
        min:   9949.00                min:  25669.00
      
        32 thread
        records: 10                   records: 10
        avg:  11067.00                avg:  34495.80
        std:    971.06(8.77%)         std:   2721.36(7.89%)
        max:  12010.00                max:  38598.00
        min:   9002.00                min:  30636.00
      
      In summary, MADV_FREE is about much faster than MADV_DONTNEED.
      
      This patch (of 12):
      
      Add core MADV_FREE implementation.
      
      [akpm@linux-foundation.org: small cleanups]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jason Evans <je@fb.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Shaohua Li" <shli@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      854e9ed0
    • Kirill A. Shutemov's avatar
      mm, vmstats: new THP splitting event · 122afea9
      Kirill A. Shutemov authored
      
      
      The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
      THP_SPLIT_PAGE_FAILED and THP_SPLIT_PMD.  It reflects the fact that we
      are going to be able split PMD without the compound page and that
      split_huge_page() can fail.
      Signed-off-by: default avatarKirill A.  Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      122afea9
  3. 06 Nov, 2015 1 commit
  4. 13 Dec, 2014 1 commit
  5. 10 Oct, 2014 1 commit
  6. 04 Jun, 2014 1 commit
  7. 03 Apr, 2014 1 commit
  8. 25 Jan, 2014 1 commit
  9. 13 Nov, 2013 1 commit
    • Mel Gorman's avatar
      mm: numa: return the number of base pages altered by protection changes · 72403b4a
      Mel Gorman authored
      Commit 0255d491
      
       ("mm: Account for a THP NUMA hinting update as one
      PTE update") was added to account for the number of PTE updates when
      marking pages prot_numa.  task_numa_work was using the old return value
      to track how much address space had been updated.  Altering the return
      value causes the scanner to do more work than it is configured or
      documented to in a single unit of work.
      
      This patch reverts that commit and accounts for the number of THP
      updates separately in vmstat.  It is up to the administrator to
      interpret the pair of values correctly.  This is a straight-forward
      operation and likely to only be of interest when actively debugging NUMA
      balancing problems.
      
      The impact of this patch is that the NUMA PTE scanner will scan slower
      when THP is enabled and workloads may converge slower as a result.  On
      the flip size system CPU usage should be lower than recent tests
      reported.  This is an illustrative example of a short single JVM specjbb
      test
      
      specjbb
                             3.12.0                3.12.0
                            vanilla      acctupdates
      TPut 1      26143.00 (  0.00%)     25747.00 ( -1.51%)
      TPut 7     185257.00 (  0.00%)    183202.00 ( -1.11%)
      TPut 13    329760.00 (  0.00%)    346577.00 (  5.10%)
      TPut 19    442502.00 (  0.00%)    460146.00 (  3.99%)
      TPut 25    540634.00 (  0.00%)    549053.00 (  1.56%)
      TPut 31    512098.00 (  0.00%)    519611.00 (  1.47%)
      TPut 37    461276.00 (  0.00%)    474973.00 (  2.97%)
      TPut 43    403089.00 (  0.00%)    414172.00 (  2.75%)
      
                    3.12.0      3.12.0
                   vanillaacctupdates
      User         5169.64     5184.14
      System        100.45       80.02
      Elapsed       252.75      251.85
      
      Performance is similar but note the reduction in system CPU time.  While
      this showed a performance gain, it will not be universal but at least
      it'll be behaving as documented.  The vmstats are obviously different but
      here is an obvious interpretation of them from mmtests.
      
                                      3.12.0      3.12.0
                                     vanillaacctupdates
      NUMA page range updates        1408326    11043064
      NUMA huge PMD updates                0       21040
      NUMA PTE updates               1408326      291624
      
      "NUMA page range updates" == nr_pte_updates and is the value returned to
      the NUMA pte scanner.  NUMA huge PMD updates were the number of THP
      updates which in combination can be used to calculate how many ptes were
      updated from userspace.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reported-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72403b4a
  10. 11 Sep, 2013 2 commits
    • Dave Hansen's avatar
      mm: vmstats: track TLB flush stats on UP too · 6df46865
      Dave Hansen authored
      
      
      The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
      counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
      for SMP.
      
      UP systems do not do remote TLB flushes, so compile those counters out on
      UP.
      
      arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly.  This is
      probably an optimization since both the mtrr code and __flush_tlb() write
      cr4.  It would probably be safe to make that a flush_tlb_all() (and then
      get these statistics), but the mtrr code is ancient and I'm hesitant to
      touch it other than to just stick in the counters.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6df46865
    • Dave Hansen's avatar
      mm: vmstats: tlb flush counters · 9824cf97
      Dave Hansen authored
      
      
      I was investigating some TLB flush scaling issues and realized that we do
      not have any good methods for figuring out how many TLB flushes we are
      doing.
      
      It would be nice to be able to do these in generic code, but the
      arch-independent calls don't explicitly specify whether we actually need
      to do remote flushes or not.  In the end, we really need to know if we
      actually _did_ global vs.  local invalidations, so that leaves us with few
      options other than to muck with the counters from arch-specific code.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9824cf97
  11. 24 Feb, 2013 1 commit
  12. 13 Dec, 2012 1 commit
  13. 11 Dec, 2012 3 commits
    • Mel Gorman's avatar
      mm: numa: Add pte updates, hinting and migration stats · 03c5a6e1
      Mel Gorman authored
      
      
      It is tricky to quantify the basic cost of automatic NUMA placement in a
      meaningful manner. This patch adds some vmstats that can be used as part
      of a basic costing model.
      
      u    = basic unit = sizeof(void *)
      Ca   = cost of struct page access = sizeof(struct page) / u
      Cpte = Cost PTE access = Ca
      Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
      	where Cpte is incurred twice for a read and a write and Wlock
      	is a constant representing the cost of taking or releasing a
      	lock
      Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
      Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
      Ci = Cost of page isolation = Ca + Wi
      	where Wi is a constant that should reflect the approximate cost
      	of the locking operation
      Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
      	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
      	would imply that remote accesses are 20% more expensive
      
      Balancing cost = Cpte * numa_pte_updates +
      		Cnumahint * numa_hint_faults +
      		Ci * numa_pages_migrated +
      		Cpagecopy * numa_pages_migrated
      
      Note that numa_pages_migrated is used as a measure of how many pages
      were isolated even though it would miss pages that failed to migrate. A
      vmstat counter could have been added for it but the isolation cost is
      pretty marginal in comparison to the overall cost so it seemed overkill.
      
      The ideal way to measure automatic placement benefit would be to count
      the number of remote accesses versus local accesses and do something like
      
      	benefit = (remote_accesses_before - remove_access_after) * Wnuma
      
      but the information is not readily available. As a workload converges, the
      expection would be that the number of remote numa hints would reduce to 0.
      
      	convergence = numa_hint_faults_local / numa_hint_faults
      		where this is measured for the last N number of
      		numa hints recorded. When the workload is fully
      		converged the value is 1.
      
      This can measure if the placement policy is converging and how fast it is
      doing it.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      03c5a6e1
    • Mel Gorman's avatar
      mm: compaction: Add scanned and isolated counters for compaction · 397487db
      Mel Gorman authored
      
      
      Compaction already has tracepoints to count scanned and isolated pages
      but it requires that ftrace be enabled and if that information has to be
      written to disk then it can be disruptive. This patch adds vmstat counters
      for compaction called compact_migrate_scanned, compact_free_scanned and
      compact_isolated.
      
      With these counters, it is possible to define a basic cost model for
      compaction. This approximates of how much work compaction is doing and can
      be compared that with an oprofile showing TLB misses and see if the cost of
      compaction is being offset by THP for example. Minimally a compaction patch
      can be evaluated in terms of whether it increases or decreases cost. The
      basic cost model looks like this
      
      Fundamental unit u:	a word	sizeof(void *)
      
      Ca  = cost of struct page access = sizeof(struct page) / u
      
      Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
      Cmf = Cost migrate failure   = Ca * 2
      Ci  = Cost page isolation    = (Ca + Wi)
      	where Wi is a constant that should reflect the approximate
      	cost of the locking operation.
      
      Csm = Cost migrate scanning = Ca
      Csf = Cost free    scanning = Ca
      
      Overall cost =	(Csm * compact_migrate_scanned) +
      	      	(Csf * compact_free_scanned)    +
      	      	(Ci  * compact_isolated)	+
      		(Cmc * pgmigrate_success)	+
      		(Cmf * pgmigrate_failed)
      
      Where the values are read from /proc/vmstat.
      
      This is very basic and ignores certain costs such as the allocation cost
      to do a migrate page copy but any improvement to the model would still
      use the same vmstat counters.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      397487db
    • Mel Gorman's avatar
      mm: compaction: Move migration fail/success stats to migrate.c · 5647bc29
      Mel Gorman authored
      
      
      The compact_pages_moved and compact_pagemigrate_failed events are
      convenient for determining if compaction is active and to what
      degree migration is succeeding but it's at the wrong level. Other
      users of migration may also want to know if migration is working
      properly and this will be particularly true for any automated
      NUMA migration. This patch moves the counters down to migration
      with the new events called pgmigrate_success and pgmigrate_fail.
      The compact_blocks_moved counter is removed because while it was
      useful for debugging initially, it's worthless now as no meaningful
      conclusions can be drawn from its value.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      5647bc29
  14. 09 Oct, 2012 2 commits
  15. 01 Aug, 2012 1 commit
    • Mel Gorman's avatar
      mm: account for the number of times direct reclaimers get throttled · 68243e76
      Mel Gorman authored
      
      
      Under significant pressure when writing back to network-backed storage,
      direct reclaimers may get throttled.  This is expected to be a short-lived
      event and the processes get woken up again but processes do get stalled.
      This patch counts how many times such stalling occurs.  It's up to the
      administrator whether to reduce these stalls by increasing
      min_free_kbytes.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68243e76
  16. 26 Apr, 2012 1 commit
  17. 27 May, 2011 1 commit
    • Andrew Morton's avatar
      mm: move enum vm_event_item into a standalone header file · f042e707
      Andrew Morton authored
      
      
      enums are problematic because they cannot be forward-declared:
      
        akpm2:/home/akpm> cat t.c
      
        enum foo;
      
        static inline void bar(enum foo f)
        {
        }
        akpm2:/home/akpm> gcc -c t.c
        t.c:4: error: parameter 1 ('f') has incomplete type
      
      So move the enum's definition into a standalone header file which can be used
      wherever its definition is needed.
      
      Cc: Ying Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f042e707