1. 13 Nov, 2013 2 commits
  2. 11 Sep, 2013 7 commits
    • Lisa Du's avatar
      mm: vmscan: fix do_try_to_free_pages() livelock · 6e543d57
      Lisa Du authored
      This patch is based on KOSAKI's work and I add a little more description,
      please refer https://lkml.org/lkml/2012/6/14/74.
      Currently, I found system can enter a state that there are lots of free
      pages in a zone but only order-0 and order-1 pages which means the zone is
      heavily fragmented, then high order allocation could make direct reclaim
      path's long stall(ex, 60 seconds) especially in no swap and no compaciton
      enviroment.  This problem happened on v3.4, but it seems issue still lives
      in current tree, the reason is do_try_to_free_pages enter live lock:
      kswapd will go to sleep if the zones have been fully scanned and are still
      not balanced.  As kswapd thinks there's little point trying all over again
      to avoid infinite loop.  Instead it changes order from high-order to
      0-order because kswapd think order-0 is the most important.  Look at
      73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
      and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
      can still perform direct reclaim if they wish.
      Direct reclaim continue to reclaim for a high order which is not a
      COSTLY_ORDER without oom-killer until kswapd turn on
      zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
      So it means direct_reclaim depends on kswapd to break this loop.
      In worst case, direct-reclaim may continue to page reclaim forever when
      kswapd sleeps forever until someone like watchdog detect and finally kill
      the process.  As described in:
      We can't turn on zone->all_unreclaimable from direct reclaim path because
      direct reclaim path don't take any lock and this way is racy.  Thus this
      patch removes zone->all_unreclaimable field completely and recalculates
      zone reclaimable state every time.
      Note: we can't take the idea that direct-reclaim see zone->pages_scanned
      directly and kswapd continue to use zone->all_unreclaimable.  Because, it
      is racy.  commit 929bea7c (vmscan: all_unreclaimable() use
      zone->all_unreclaimable as a name) describes the detail.
      [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
      Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Neil Zhang <zhangwm@marvell.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarLisa Du <cldu@marvell.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Christoph Lameter's avatar
      vmstat: use this_cpu() to avoid irqon/off sequence in refresh_cpu_vm_stats · fbc2edb0
      Christoph Lameter authored
      Disabling interrupts repeatedly can be avoided in the inner loop if we use
      a this_cpu operation.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Christoph Lameter's avatar
      vmstat: create fold_diff · 4edb0748
      Christoph Lameter authored
      Both functions that update global counters use the same mechanism.
      Create a function that contains the common code.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Christoph Lameter's avatar
      vmstat: create separate function to fold per cpu diffs into local counters · 2bb921e5
      Christoph Lameter authored
      The main idea behind this patchset is to reduce the vmstat update overhead
      by avoiding interrupt enable/disable and the use of per cpu atomics.
      This patch (of 3):
      It is better to have a separate folding function because
      refresh_cpu_vm_stats() also does other things like expire pages in the
      page allocator caches.
      If we have a separate function then refresh_cpu_vm_stats() is only called
      from the local cpu which allows additional optimizations.
      The folding function is only called when a cpu is being downed and
      therefore no other processor will be accessing the counters.  Also
      simplifies synchronization.
      [akpm@linux-foundation.org: fix UP build]
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Johannes Weiner's avatar
      mm: page_alloc: fair zone allocator policy · 81c0a2bb
      Johannes Weiner authored
      Each zone that holds userspace pages of one workload must be aged at a
      speed proportional to the zone size.  Otherwise, the time an individual
      page gets to stay in memory depends on the zone it happened to be
      allocated in.  Asymmetry in the zone aging creates rather unpredictable
      aging behavior and results in the wrong pages being reclaimed, activated
      But exactly this happens right now because of the way the page allocator
      and kswapd interact.  The page allocator uses per-node lists of all zones
      in the system, ordered by preference, when allocating a new page.  When
      the first iteration does not yield any results, kswapd is woken up and the
      allocator retries.  Due to the way kswapd reclaims zones below the high
      watermark while a zone can be allocated from when it is above the low
      watermark, the allocator may keep kswapd running while kswapd reclaim
      ensures that the page allocator can keep allocating from the first zone in
      the zonelist for extended periods of time.  Meanwhile the other zones
      rarely see new allocations and thus get aged much slower in comparison.
      The result is that the occasional page placed in lower zones gets
      relatively more time in memory, even gets promoted to the active list
      after its peers have long been evicted.  Meanwhile, the bulk of the
      working set may be thrashing on the preferred zone even though there may
      be significant amounts of memory available in the lower zones.
      Even the most basic test -- repeatedly reading a file slightly bigger than
      memory -- shows how broken the zone aging is.  In this scenario, no single
      page should be able stay in memory long enough to get referenced twice and
      activated, but activation happens in spades:
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 0
            nr_active_file 8
            nr_inactive_file 1582
            nr_active_file 11994
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 70
            nr_inactive_file 258753
            nr_active_file 443214
            nr_inactive_file 149793
            nr_active_file 12021
      Fix this with a very simple round robin allocator.  Each zone is allowed a
      batch of allocations that is proportional to the zone's size, after which
      it is treated as full.  The batch counters are reset when all zones have
      been tried and the allocator enters the slowpath and kicks off kswapd
      reclaim.  Allocation and reclaim is now fairly spread out to all
      available/allowable zones:
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 174
            nr_active_file 4865
            nr_inactive_file 53
            nr_active_file 860
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 666622
            nr_active_file 4988
            nr_inactive_file 190969
            nr_active_file 937
      When zone_reclaim_mode is enabled, allocations will now spread out to all
      zones on the local node, not just the first preferred zone (which on a 4G
      node might be a tiny Normal zone).
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul Bolle <paul.bollee@gmail.com>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Tested-by: default avatarKevin Hilman <khilman@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Dave Hansen's avatar
      mm: vmstats: track TLB flush stats on UP too · 6df46865
      Dave Hansen authored
      The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
      counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
      for SMP.
      UP systems do not do remote TLB flushes, so compile those counters out on
      arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly.  This is
      probably an optimization since both the mtrr code and __flush_tlb() write
      cr4.  It would probably be safe to make that a flush_tlb_all() (and then
      get these statistics), but the mtrr code is ancient and I'm hesitant to
      touch it other than to just stick in the counters.
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Dave Hansen's avatar
      mm: vmstats: tlb flush counters · 9824cf97
      Dave Hansen authored
      I was investigating some TLB flush scaling issues and realized that we do
      not have any good methods for figuring out how many TLB flushes we are
      It would be nice to be able to do these in generic code, but the
      arch-independent calls don't explicitly specify whether we actually need
      to do remote flushes or not.  In the end, we really need to know if we
      actually _did_ global vs.  local invalidations, so that leaves us with few
      options other than to muck with the counters from arch-specific code.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  3. 14 Jul, 2013 1 commit
    • Paul Gortmaker's avatar
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker authored
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
  4. 29 Apr, 2013 2 commits
  5. 24 Feb, 2013 4 commits
  6. 13 Dec, 2012 3 commits
    • Jiang Liu's avatar
      mm: introduce new field "managed_pages" to struct zone · 9feedc9d
      Jiang Liu authored
      Currently a zone's present_pages is calcuated as below, which is
      inaccurate and may cause trouble to memory hotplug.
      	spanned_pages - absent_pages - memmap_pages - dma_reserve.
      During fixing bugs caused by inaccurate zone->present_pages, we found
      zone->present_pages has been abused.  The field zone->present_pages may
      have different meanings in different contexts:
      1) pages existing in a zone.
      2) pages managed by the buddy system.
      For more discussions about the issue, please refer to:
      This patchset tries to introduce a new field named "managed_pages" to
      struct zone, which counts "pages managed by the buddy system".  And revert
      zone->present_pages to count "physical pages existing in a zone", which
      also keep in consistence with pgdat->node_present_pages.
      We will set an initial value for zone->managed_pages in function
      free_area_init_core() and will adjust it later if the initial value is
      For DMA/normal zones, the initial value is set to:
      	(spanned_pages - absent_pages - memmap_pages - dma_reserve)
      Later zone->managed_pages will be adjusted to the accurate value when the
      bootmem allocator frees all free pages to the buddy system in function
      free_all_bootmem_node() and free_all_bootmem().
      The bootmem allocator doesn't touch highmem pages, so highmem zones'
      managed_pages is set to the accurate value "spanned_pages - absent_pages"
      in function free_area_init_core() and won't be updated anymore.
      This patch also adds a new field "managed_pages" to /proc/zoneinfo
      and sysrq showmem.
      [akpm@linux-foundation.org: small comment tweaks]
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Tested-by: default avatarChris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Lai Jiangshan's avatar
      vmstat: use N_MEMORY instead N_HIGH_MEMORY · a47b53c5
      Lai Jiangshan authored
      N_HIGH_MEMORY stands for the nodes that has normal or high memory.
      N_MEMORY stands for the nodes that has any memory.
      The code here need to handle with the nodes which have memory, we should
      use N_MEMORY instead.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Lin Feng <linfeng@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Kirill A. Shutemov's avatar
      thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events · d8a8e1f0
      Kirill A. Shutemov authored
      hzp_alloc is incremented every time a huge zero page is successfully
      	allocated. It includes allocations which where dropped due
      	race with other allocation. Note, it doesn't count every map
      	of the huge zero page, only its allocation.
      hzp_alloc_failed is incremented if kernel fails to allocate huge zero
      	page and falls back to using small pages.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  7. 11 Dec, 2012 3 commits
    • Mel Gorman's avatar
      mm: numa: Add pte updates, hinting and migration stats · 03c5a6e1
      Mel Gorman authored
      It is tricky to quantify the basic cost of automatic NUMA placement in a
      meaningful manner. This patch adds some vmstats that can be used as part
      of a basic costing model.
      u    = basic unit = sizeof(void *)
      Ca   = cost of struct page access = sizeof(struct page) / u
      Cpte = Cost PTE access = Ca
      Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
      	where Cpte is incurred twice for a read and a write and Wlock
      	is a constant representing the cost of taking or releasing a
      Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
      Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
      Ci = Cost of page isolation = Ca + Wi
      	where Wi is a constant that should reflect the approximate cost
      	of the locking operation
      Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
      	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
      	would imply that remote accesses are 20% more expensive
      Balancing cost = Cpte * numa_pte_updates +
      		Cnumahint * numa_hint_faults +
      		Ci * numa_pages_migrated +
      		Cpagecopy * numa_pages_migrated
      Note that numa_pages_migrated is used as a measure of how many pages
      were isolated even though it would miss pages that failed to migrate. A
      vmstat counter could have been added for it but the isolation cost is
      pretty marginal in comparison to the overall cost so it seemed overkill.
      The ideal way to measure automatic placement benefit would be to count
      the number of remote accesses versus local accesses and do something like
      	benefit = (remote_accesses_before - remove_access_after) * Wnuma
      but the information is not readily available. As a workload converges, the
      expection would be that the number of remote numa hints would reduce to 0.
      	convergence = numa_hint_faults_local / numa_hint_faults
      		where this is measured for the last N number of
      		numa hints recorded. When the workload is fully
      		converged the value is 1.
      This can measure if the placement policy is converging and how fast it is
      doing it.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
    • Mel Gorman's avatar
      mm: compaction: Add scanned and isolated counters for compaction · 397487db
      Mel Gorman authored
      Compaction already has tracepoints to count scanned and isolated pages
      but it requires that ftrace be enabled and if that information has to be
      written to disk then it can be disruptive. This patch adds vmstat counters
      for compaction called compact_migrate_scanned, compact_free_scanned and
      With these counters, it is possible to define a basic cost model for
      compaction. This approximates of how much work compaction is doing and can
      be compared that with an oprofile showing TLB misses and see if the cost of
      compaction is being offset by THP for example. Minimally a compaction patch
      can be evaluated in terms of whether it increases or decreases cost. The
      basic cost model looks like this
      Fundamental unit u:	a word	sizeof(void *)
      Ca  = cost of struct page access = sizeof(struct page) / u
      Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
      Cmf = Cost migrate failure   = Ca * 2
      Ci  = Cost page isolation    = (Ca + Wi)
      	where Wi is a constant that should reflect the approximate
      	cost of the locking operation.
      Csm = Cost migrate scanning = Ca
      Csf = Cost free    scanning = Ca
      Overall cost =	(Csm * compact_migrate_scanned) +
      	      	(Csf * compact_free_scanned)    +
      	      	(Ci  * compact_isolated)	+
      		(Cmc * pgmigrate_success)	+
      		(Cmf * pgmigrate_failed)
      Where the values are read from /proc/vmstat.
      This is very basic and ignores certain costs such as the allocation cost
      to do a migrate page copy but any improvement to the model would still
      use the same vmstat counters.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
    • Mel Gorman's avatar
      mm: compaction: Move migration fail/success stats to migrate.c · 5647bc29
      Mel Gorman authored
      The compact_pages_moved and compact_pagemigrate_failed events are
      convenient for determining if compaction is active and to what
      degree migration is succeeding but it's at the wrong level. Other
      users of migration may also want to know if migration is working
      properly and this will be particularly true for any automated
      NUMA migration. This patch moves the counters down to migration
      with the new events called pgmigrate_success and pgmigrate_fail.
      The compact_blocks_moved counter is removed because while it was
      useful for debugging initially, it's worthless now as no meaningful
      conclusions can be drawn from its value.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
  8. 09 Oct, 2012 4 commits
  9. 21 Aug, 2012 1 commit
  10. 01 Aug, 2012 1 commit
    • Mel Gorman's avatar
      mm: account for the number of times direct reclaimers get throttled · 68243e76
      Mel Gorman authored
      Under significant pressure when writing back to network-backed storage,
      direct reclaimers may get throttled.  This is expected to be a short-lived
      event and the processes get woken up again but processes do get stalled.
      This patch counts how many times such stalling occurs.  It's up to the
      administrator whether to reduce these stalls by increasing
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  11. 29 May, 2012 1 commit
  12. 21 May, 2012 1 commit
  13. 26 Apr, 2012 1 commit
  14. 13 Jan, 2012 1 commit
  15. 01 Nov, 2011 3 commits
    • Dimitri Sivanich's avatar
      mm/vmstat.c: cache align vm_stat · a1cb2c60
      Dimitri Sivanich authored
      Avoid false sharing of the vm_stat array.
      This was found to adversely affect tmpfs I/O performance.
      Tests run on a 640 cpu UV system.
      With 120 threads doing parallel writes, each to different tmpfs mounts:
      No patch:		~300 MB/sec
      With vm_stat alignment:	~430 MB/sec
      Signed-off-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Acked-by: default avatarChristoph Lameter <cl@gentwo.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mel Gorman's avatar
      mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completes · 49ea7eb6
      Mel Gorman authored
      When direct reclaim encounters a dirty page, it gets recycled around the
      LRU for another cycle.  This patch marks the page PageReclaim similar to
      deactivate_page() so that the page gets reclaimed almost immediately after
      the page gets cleaned.  This is to avoid reclaiming clean pages that are
      younger than a dirty page encountered at the end of the LRU that might
      have been something like a use-once page.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mel Gorman's avatar
      mm: vmscan: do not writeback filesystem pages in direct reclaim · ee72886d
      Mel Gorman authored
      Testing from the XFS folk revealed that there is still too much I/O from
      the end of the LRU in kswapd.  Previously it was considered acceptable by
      VM people for a small number of pages to be written back from reclaim with
      testing generally showing about 0.3% of pages reclaimed were written back
      (higher if memory was low).  That writing back a small number of pages is
      ok has been heavily disputed for quite some time and Dave Chinner
      explained it well;
      	It doesn't have to be a very high number to be a problem. IO
      	is orders of magnitude slower than the CPU time it takes to
      	flush a page, so the cost of making a bad flush decision is
      	very high. And single page writeback from the LRU is almost
      	always a bad flush decision.
      To complicate matters, filesystems respond very differently to requests
      from reclaim according to Christoph Hellwig;
      	xfs tries to write it back if the requester is kswapd
      	ext4 ignores the request if it's a delayed allocation
      	btrfs ignores the request
      As a result, each filesystem has different performance characteristics
      when under memory pressure and there are many pages being dirtied.  In
      some cases, the request is ignored entirely so the VM cannot depend on the
      IO being dispatched.
      The objective of this series is to reduce writing of filesystem-backed
      pages from reclaim, play nicely with writeback that is already in progress
      and throttle reclaim appropriately when writeback pages are encountered.
      The assumption is that the flushers will always write pages faster than if
      reclaim issues the IO.
      A secondary goal is to avoid the problem whereby direct reclaim splices
      two potentially deep call stacks together.
      There is a potential new problem as reclaim has less control over how long
      before a page in a particularly zone or container is cleaned and direct
      reclaimers depend on kswapd or flusher threads to do the necessary work.
      However, as filesystems sometimes ignore direct reclaim requests already,
      it is not expected to be a serious issue.
      Patch 1 disables writeback of filesystem pages from direct reclaim
      	entirely. Anonymous pages are still written.
      Patch 2 removes dead code in lumpy reclaim as it is no longer able
      	to synchronously write pages. This hurts lumpy reclaim but
      	there is an expectation that compaction is used for hugepage
      	allocations these days and lumpy reclaim's days are numbered.
      Patches 3-4 add warnings to XFS and ext4 if called from
      	direct reclaim. With patch 1, this "never happens" and is
      	intended to catch regressions in this logic in the future.
      Patch 5 disables writeback of filesystem pages from kswapd unless
      	the priority is raised to the point where kswapd is considered
      	to be in trouble.
      Patch 6 throttles reclaimers if too many dirty pages are being
      	encountered and the zones or backing devices are congested.
      Patch 7 invalidates dirty pages found at the end of the LRU so they
      	are reclaimed quickly after being written back rather than
      	waiting for a reclaimer to find them
      I consider this series to be orthogonal to the writeback work but it is
      worth noting that the writeback work affects the viability of patch 8 in
      I tested this on ext4 and xfs using fs_mark, a simple writeback test based
      on dd and a micro benchmark that does a streaming write to a large mapping
      (exercises use-once LRU logic) followed by streaming writes to a mix of
      anonymous and file-backed mappings.  The command line for fs_mark when
      botted with 512M looked something like
      ./fs_mark -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
      The number of files was adjusted depending on the amount of available
      memory so that the files created was about 3xRAM.  For multiple threads,
      the -d switch is specified multiple times.
      The test machine is x86-64 with an older generation of AMD processor with
      4 cores.  The underlying storage was 4 disks configured as RAID-0 as this
      was the best configuration of storage I had available.  Swap is on a
      separate disk.  Dirty ratio was tuned to 40% instead of the default of
      Testing was run with and without monitors to both verify that the patches
      were operating as expected and that any performance gain was real and not
      due to interference from monitors.
      Here is a summary of results based on testing XFS.
      512M1P-xfs           Files/s  mean                 32.69 ( 0.00%)     34.44 ( 5.08%)
      512M1P-xfs           Elapsed Time fsmark                    51.41     48.29
      512M1P-xfs           Elapsed Time simple-wb                114.09    108.61
      512M1P-xfs           Elapsed Time mmap-strm                113.46    109.34
      512M1P-xfs           Kswapd efficiency fsmark                 62%       63%
      512M1P-xfs           Kswapd efficiency simple-wb              56%       61%
      512M1P-xfs           Kswapd efficiency mmap-strm              44%       42%
      512M-xfs             Files/s  mean                 30.78 ( 0.00%)     35.94 (14.36%)
      512M-xfs             Elapsed Time fsmark                    56.08     48.90
      512M-xfs             Elapsed Time simple-wb                112.22     98.13
      512M-xfs             Elapsed Time mmap-strm                219.15    196.67
      512M-xfs             Kswapd efficiency fsmark                 54%       56%
      512M-xfs             Kswapd efficiency simple-wb              54%       55%
      512M-xfs             Kswapd efficiency mmap-strm              45%       44%
      512M-4X-xfs          Files/s  mean                 30.31 ( 0.00%)     33.33 ( 9.06%)
      512M-4X-xfs          Elapsed Time fsmark                    63.26     55.88
      512M-4X-xfs          Elapsed Time simple-wb                100.90     90.25
      512M-4X-xfs          Elapsed Time mmap-strm                261.73    255.38
      512M-4X-xfs          Kswapd efficiency fsmark                 49%       50%
      512M-4X-xfs          Kswapd efficiency simple-wb              54%       56%
      512M-4X-xfs          Kswapd efficiency mmap-strm              37%       36%
      512M-16X-xfs         Files/s  mean                 60.89 ( 0.00%)     65.22 ( 6.64%)
      512M-16X-xfs         Elapsed Time fsmark                    67.47     58.25
      512M-16X-xfs         Elapsed Time simple-wb                103.22     90.89
      512M-16X-xfs         Elapsed Time mmap-strm                237.09    198.82
      512M-16X-xfs         Kswapd efficiency fsmark                 45%       46%
      512M-16X-xfs         Kswapd efficiency simple-wb              53%       55%
      512M-16X-xfs         Kswapd efficiency mmap-strm              33%       33%
      Up until 512-4X, the FSmark improvements were statistically significant.
      For the 4X and 16X tests the results were within standard deviations but
      just barely.  The time to completion for all tests is improved which is an
      important result.  In general, kswapd efficiency is not affected by
      skipping dirty pages.
      1024M1P-xfs          Files/s  mean                 39.09 ( 0.00%)     41.15 ( 5.01%)
      1024M1P-xfs          Elapsed Time fsmark                    84.14     80.41
      1024M1P-xfs          Elapsed Time simple-wb                210.77    184.78
      1024M1P-xfs          Elapsed Time mmap-strm                162.00    160.34
      1024M1P-xfs          Kswapd efficiency fsmark                 69%       75%
      1024M1P-xfs          Kswapd efficiency simple-wb              71%       77%
      1024M1P-xfs          Kswapd efficiency mmap-strm              43%       44%
      1024M-xfs            Files/s  mean                 35.45 ( 0.00%)     37.00 ( 4.19%)
      1024M-xfs            Elapsed Time fsmark                    94.59     91.00
      1024M-xfs            Elapsed Time simple-wb                229.84    195.08
      1024M-xfs            Elapsed Time mmap-strm                405.38    440.29
      1024M-xfs            Kswapd efficiency fsmark                 79%       71%
      1024M-xfs            Kswapd efficiency simple-wb              74%       74%
      1024M-xfs            Kswapd efficiency mmap-strm              39%       42%
      1024M-4X-xfs         Files/s  mean                 32.63 ( 0.00%)     35.05 ( 6.90%)
      1024M-4X-xfs         Elapsed Time fsmark                   103.33     97.74
      1024M-4X-xfs         Elapsed Time simple-wb                204.48    178.57
      1024M-4X-xfs         Elapsed Time mmap-strm                528.38    511.88
      1024M-4X-xfs         Kswapd efficiency fsmark                 81%       70%
      1024M-4X-xfs         Kswapd efficiency simple-wb              73%       72%
      1024M-4X-xfs         Kswapd efficiency mmap-strm              39%       38%
      1024M-16X-xfs        Files/s  mean                 42.65 ( 0.00%)     42.97 ( 0.74%)
      1024M-16X-xfs        Elapsed Time fsmark                   103.11     99.11
      1024M-16X-xfs        Elapsed Time simple-wb                200.83    178.24
      1024M-16X-xfs        Elapsed Time mmap-strm                397.35    459.82
      1024M-16X-xfs        Kswapd efficiency fsmark                 84%       69%
      1024M-16X-xfs        Kswapd efficiency simple-wb              74%       73%
      1024M-16X-xfs        Kswapd efficiency mmap-strm              39%       40%
      All FSMark tests up to 16X had statistically significant improvements.
      For the most part, tests are completing faster with the exception of the
      streaming writes to a mixture of anonymous and file-backed mappings which
      were slower in two cases
      In the cases where the mmap-strm tests were slower, there was more
      swapping due to dirty pages being skipped.  The number of additional pages
      swapped is almost identical to the fewer number of pages written from
      reclaim.  In other words, roughly the same number of pages were reclaimed
      but swapping was slower.  As the test is a bit unrealistic and stresses
      memory heavily, the small shift is acceptable.
      4608M1P-xfs          Files/s  mean                 29.75 ( 0.00%)     30.96 ( 3.91%)
      4608M1P-xfs          Elapsed Time fsmark                   512.01    492.15
      4608M1P-xfs          Elapsed Time simple-wb                618.18    566.24
      4608M1P-xfs          Elapsed Time mmap-strm                488.05    465.07
      4608M1P-xfs          Kswapd efficiency fsmark                 93%       86%
      4608M1P-xfs          Kswapd efficiency simple-wb              88%       84%
      4608M1P-xfs          Kswapd efficiency mmap-strm              46%       45%
      4608M-xfs            Files/s  mean                 27.60 ( 0.00%)     28.85 ( 4.33%)
      4608M-xfs            Elapsed Time fsmark                   555.96    532.34
      4608M-xfs            Elapsed Time simple-wb                659.72    571.85
      4608M-xfs            Elapsed Time mmap-strm               1082.57   1146.38
      4608M-xfs            Kswapd efficiency fsmark                 89%       91%
      4608M-xfs            Kswapd efficiency simple-wb              88%       82%
      4608M-xfs            Kswapd efficiency mmap-strm              48%       46%
      4608M-4X-xfs         Files/s  mean                 26.00 ( 0.00%)     27.47 ( 5.35%)
      4608M-4X-xfs         Elapsed Time fsmark                   592.91    564.00
      4608M-4X-xfs         Elapsed Time simple-wb                616.65    575.07
      4608M-4X-xfs         Elapsed Time mmap-strm               1773.02   1631.53
      4608M-4X-xfs         Kswapd efficiency fsmark                 90%       94%
      4608M-4X-xfs         Kswapd efficiency simple-wb              87%       82%
      4608M-4X-xfs         Kswapd efficiency mmap-strm              43%       43%
      4608M-16X-xfs        Files/s  mean                 26.07 ( 0.00%)     26.42 ( 1.32%)
      4608M-16X-xfs        Elapsed Time fsmark                   602.69    585.78
      4608M-16X-xfs        Elapsed Time simple-wb                606.60    573.81
      4608M-16X-xfs        Elapsed Time mmap-strm               1549.75   1441.86
      4608M-16X-xfs        Kswapd efficiency fsmark                 98%       98%
      4608M-16X-xfs        Kswapd efficiency simple-wb              88%       82%
      4608M-16X-xfs        Kswapd efficiency mmap-strm              44%       42%
      Unlike the other tests, the fsmark results are not statistically
      significant but the min and max times are both improved and for the most
      part, tests completed faster.
      There are other indications that this is an improvement as well.  For
      example, in the vast majority of cases, there were fewer pages scanned by
      direct reclaim implying in many cases that stalls due to direct reclaim
      are reduced.  KSwapd is scanning more due to skipping dirty pages which is
      unfortunate but the CPU usage is still acceptable
      In an earlier set of tests, I used blktrace and in almost all cases
      throughput throughout the entire test was higher.  However, I ended up
      discarding those results as recording blktrace data was too heavy for my
      On a laptop, I plugged in a USB stick and ran a similar tests of tests
      using it as backing storage.  A desktop environment was running and for
      the entire duration of the tests, firefox and gnome terminal were
      launching and exiting to vaguely simulate a user.
      1024M-xfs            Files/s  mean               0.41 ( 0.00%)        0.44 ( 6.82%)
      1024M-xfs            Elapsed Time fsmark               2053.52   1641.03
      1024M-xfs            Elapsed Time simple-wb            1229.53    768.05
      1024M-xfs            Elapsed Time mmap-strm            4126.44   4597.03
      1024M-xfs            Kswapd efficiency fsmark              84%       85%
      1024M-xfs            Kswapd efficiency simple-wb           92%       81%
      1024M-xfs            Kswapd efficiency mmap-strm           60%       51%
      1024M-xfs            Avg wait ms fsmark                5404.53     4473.87
      1024M-xfs            Avg wait ms simple-wb             2541.35     1453.54
      1024M-xfs            Avg wait ms mmap-strm             3400.25     3852.53
      The mmap-strm results were hurt because firefox launching had a tendency
      to push the test out of memory.  On the postive side, firefox launched
      marginally faster with the patches applied.  Time to completion for many
      tests was faster but more importantly - the "Avg wait" time as measured by
      iostat was far lower implying the system would be more responsive.  It was
      also the case that "Avg wait ms" on the root filesystem was lower.  I
      tested it manually and while the system felt slightly more responsive
      while copying data to a USB stick, it was marginal enough that it could be
      my imagination.
      This patch: do not writeback filesystem pages in direct reclaim.
      When kswapd is failing to keep zones above the min watermark, a process
      will enter direct reclaim in the same manner kswapd does.  If a dirty page
      is encountered during the scan, this page is written to backing storage
      using mapping->writepage.
      This causes two problems.  First, it can result in very deep call stacks,
      particularly if the target storage or filesystem are complex.  Some
      filesystems ignore write requests from direct reclaim as a result.  The
      second is that a single-page flush is inefficient in terms of IO.  While
      there is an expectation that the elevator will merge requests, this does
      not always happen.  Quoting Christoph Hellwig;
      	The elevator has a relatively small window it can operate on,
      	and can never fix up a bad large scale writeback pattern.
      This patch prevents direct reclaim writing back filesystem pages by
      checking if current is kswapd.  Anonymous pages are still written to swap
      as there is not the equivalent of a flusher thread for anonymous pages.
      If the dirty pages cannot be written back, they are placed back on the LRU
      lists.  There is now a direct dependency on dirty page balancing to
      prevent too many pages in the system being dirtied which would prevent
      reclaim making forward progress.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  16. 15 Sep, 2011 1 commit
  17. 25 May, 2011 2 commits
  18. 14 Apr, 2011 2 commits