1. 13 Jan, 2012 40 commits
    • Michael Holzheu's avatar
      kdump: add udev events for memory online/offline · f5138e42
      Michael Holzheu authored
      
      
      Currently no udev events for memory hotplug "online" and "offline" are
      generated:
      
        # udevadm monitor
        # echo offline > /sys/devices/system/memory/memory4/state
        ==> No event
      
      When kdump is loaded, kexec detects the current memory configuration and
      stores it in the pre-allocated ELF core header.  Therefore, for kdump it
      is necessary to reload the kdump kernel with kexec when the memory
      configuration changes (e.g.  for online/offline hotplug memory).
      
      In order to do this automatically, udev rules should be used.  This kernel
      patch adds udev events for "online" and "offline".  Together with this
      kernel patch, the following udev rules for online/offline have to be added
      to "/etc/udev/rules.d/98-kexec.rules":
      
        SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/etc/init.d/kdump restart"
        SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/etc/init.d/kdump restart"
      
      [sfr@canb.auug.org.au: fixups for class to subsystem conversion]
      Signed-off-by: default avatarMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5138e42
    • Fabio Estevam's avatar
      include/linux/crash_dump.h needs elf.h · 1f536b9e
      Fabio Estevam authored
      Building an ARM target we get the following warnings:
      
        CC      arch/arm/kernel/setup.o
        In file included from arch/arm/kernel/setup.c:39:
        arch/arm/include/asm/elf.h:102:1: warning: "vmcore_elf64_check_arch" redefined
        In file included from arch/arm/kernel/setup.c:24:
        include/linux/crash_dump.h:30:1: warning: this is the location of the previous definition
      
      Quoting Russell King:
      
      "linux/crash_dump.h makes no attempt to include asm/elf.h, but it depends
      on stuff in asm/elf.h to determine how stuff inside this file is defined
      at parse time.
      
      So, if asm/elf.h is included after linux/crash_dump.h or not at all, you
      get a different result from the situation where asm/elf.h is included
      before."
      
      So add elf.h header to crash_dump.h to avoid this problem.
      
      The original discussion about this can be found at:
      http://www.spinics.net/lists/arm-kernel/msg154113.html
      
      Signed-off-by: default avatarFabio Estevam <fabio.estevam@freescale.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[3.2.1]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f536b9e
    • Michael Holzheu's avatar
      kdump: fix crash_kexec()/smp_send_stop() race in panic() · 93e13a36
      Michael Holzheu authored
      
      
      When two CPUs call panic at the same time there is a possible race
      condition that can stop kdump.  The first CPU calls crash_kexec() and the
      second CPU calls smp_send_stop() in panic() before crash_kexec() finished
      on the first CPU.  So the second CPU stops the first CPU and therefore
      kdump fails:
      
      1st CPU:
        panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump
      
      2nd CPU:
        panic()->crash_kexec()->kexec_mutex already held by 1st CPU
             ->smp_send_stop()-> stop 1st CPU (stop kdump)
      
      This patch fixes the problem by introducing a spinlock in panic that
      allows only one CPU to process crash_kexec() and the subsequent panic
      code.
      
      All other CPUs call the weak function panic_smp_self_stop() that stops the
      CPU itself.  This function can be overloaded by architecture code.  For
      example "tile" can use their lower-power "nap" instruction for that.
      Signed-off-by: default avatarMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Acked-by: default avatarChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93e13a36
    • Michael Holzheu's avatar
      kdump: crashk_res init check for /sys/kernel/kexec_crash_size · bec013c4
      Michael Holzheu authored
      
      
      Currently it is possible to set the crash_size via the sysfs
      /sys/kernel/kexec_crash_size even if no crash kernel memory has been
      defined with the "crashkernel" parameter.  In this case "crashk_res" is
      not initialized and crashk_res.start = crashk_res.end = 0.  Unfortunately
      resource_size(&crashk_res) returns 1 in this case.  This breaks the s390
      implementation of crash_(un)map_reserved_pages().
      
      To fix the problem the correct "old_size" is now calculated in
      crash_shrink_memory().  "old_size is set to "0" if crashk_res is not
      initialized.  With this change crash_shrink_memory() will do nothing, when
      "crashk_res" is not initialized.  It will return "0" for "echo 0 >
      /sys/kernel/kexec_crash_size" and -EINVAL for "echo [not zero] >
      /sys/kernel/kexec_crash_size".
      
      In addition to that this patch also simplifies the "ret = -EINVAL" vs.
      "ret = 0" logic as suggested by Simon Horman.
      Signed-off-by: default avatarMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Reviewed-by: default avatarDave Young <dyoung@redhat.com>
      Reviewed-by: default avatarWANG Cong <xiyou.wangcong@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@verge.net.au>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bec013c4
    • Michael Holzheu's avatar
      kdump: add missing RAM resource in crash_shrink_memory() · 6480e5a0
      Michael Holzheu authored
      
      
      When shrinking crashkernel memory using /sys/kernel/kexec_crash_size for
      the newly added memory no RAM resource is created at the moment.
      
      Example:
      
        $ cat /proc/iomem
        00000000-bfffffff : System RAM
          00000000-005b7ac3 : Kernel code
          005b7ac4-009743bf : Kernel data
          009bb000-00a85c33 : Kernel bss
        c0000000-cfffffff : Crash kernel
        d0000000-ffffffff : System RAM
      
        $ echo 0 > /sys/kernel/kexec_crash_size
        $ cat /proc/iomem
        00000000-bfffffff : System RAM
          00000000-005b7ac3 : Kernel code
          005b7ac4-009743bf : Kernel data
          009bb000-00a85c33 : Kernel bss
                                         <<-- here is System RAM missing
        d0000000-ffffffff : System RAM
      
      One result of this bug is that the memory chunk can never be set offline
      using memory hotplug.  With this patch I insert a new "System RAM"
      resource for the released memory.  Then the upper example looks like the
      following:
      
        $ echo 0 > /sys/kernel/kexec_crash_size
        $ cat /proc/iomem
        00000000-bfffffff : System RAM
          00000000-005b7ac3 : Kernel code
          005b7ac4-009743bf : Kernel data
          009bb000-00a85c33 : Kernel bss
        c0000000-cfffffff : System RAM   <<-- new rescoure
        d0000000-ffffffff : System RAM
      
      And now I can set chunk c0000000-cfffffff offline.
      Signed-off-by: default avatarMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6480e5a0
    • WANG Cong's avatar
      kexec: remove KMSG_DUMP_KEXEC · a3dd3323
      WANG Cong authored
      
      
      KMSG_DUMP_KEXEC is useless because we already save kernel messages inside
      /proc/vmcore, and it is unsafe to allow modules to do other stuffs in a
      crash dump scenario.
      
      [akpm@linux-foundation.org: fix powerpc build]
      Signed-off-by: default avatarWANG Cong <xiyou.wangcong@gmail.com>
      Reported-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarJarod Wilson <jarod@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3dd3323
    • Wanlong Gao's avatar
      cpumask: update setup_node_to_cpumask_map() comments · 9512938b
      Wanlong Gao authored
      node_to_cpumask() has been replaced by cpumask_of_node(), and wholly
      removed since commit 29c337a0
      
       ("cpumask: remove obsolete node_to_cpumask
      now everyone uses cpumask_of_node").
      
      So update the comments for setup_node_to_cpumask_map().
      Signed-off-by: default avatarWanlong Gao <gaowanlong@cn.fujitsu.com>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9512938b
    • Kautuk Consul's avatar
      mm/vmalloc.c: eliminate extra loop in pcpu_get_vm_areas error path · f1db7afd
      Kautuk Consul authored
      
      
      If either of the vas or vms arrays are not properly kzalloced, then the
      code jumps to the err_free label.
      
      The err_free label runs a loop to check and free each of the array members
      of the vas and vms arrays which is not required for this situation as none
      of the array members have been allocated till this point.
      
      Eliminate the extra loop we have to go through by introducing a new label
      err_free2 and then jumping to it.
      
      [akpm@linux-foundation.org: remove now-unneeded tests]
      Signed-off-by: default avatarKautuk Consul <consul.kautuk@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1db7afd
    • Hugh Dickins's avatar
      mm: rearrange putback_inactive_pages · 3f79768f
      Hugh Dickins authored
      
      
      There is sometimes confusion between the global putback_lru_pages() in
      migrate.c and the static putback_lru_pages() in vmscan.c: rename the
      latter putback_inactive_pages(): it helps shrink_inactive_list() rather as
      move_active_pages_to_lru() helps shrink_active_list().
      
      Remove unused scan_control arg from putback_inactive_pages() and from
      update_isolated_counts().  Move clear_active_flags() inside
      update_isolated_counts().  Move NR_ISOLATED accounting up into
      shrink_inactive_list() itself, so the balance is clearer.
      
      Do the spin_lock_irq() before calling putback_inactive_pages() and
      spin_unlock_irq() after return from it, so that it better matches
      update_isolated_counts() and move_active_pages_to_lru().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f79768f
    • Hugh Dickins's avatar
      mm: remove isolate_pages() · f626012d
      Hugh Dickins authored
      
      
      The isolate_pages() level in vmscan.c offers little but indirection: merge
      it into isolate_lru_pages() as the compiler does, and use the names
      nr_to_scan and nr_scanned in each case.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f626012d
    • Hugh Dickins's avatar
      mm: remove del_page_from_lru, add page_off_lru · 1c1c53d4
      Hugh Dickins authored
      
      
      del_page_from_lru() repeats del_page_from_lru_list(), also working out
      which LRU the page was on, clearing the relevant bits.  Decouple those
      functions: remove del_page_from_lru() and add page_off_lru().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c1c53d4
    • Hugh Dickins's avatar
      mm: enum lru_list lru · 4111304d
      Hugh Dickins authored
      
      
      Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4111304d
    • Hugh Dickins's avatar
      mm: no blank line after EXPORT_SYMBOL in swap.c · 4d06f382
      Hugh Dickins authored
      
      
      checkpatch rightly protests
      
        WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
      
      so fix the five offenders in mm/swap.c.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d06f382
    • Hugh Dickins's avatar
      mm: fewer underscores in ____pagevec_lru_add · 5095ae83
      Hugh Dickins authored
      
      
      What's so special about ____pagevec_lru_add() that it needs four leading
      underscores?  Nothing, it just helped to distinguish from
      __pagevec_lru_add() in 2.6.28 development.  Cut two leading underscores.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5095ae83
    • Hugh Dickins's avatar
      mm: take pagevecs off reclaim stack · 2bcf8879
      Hugh Dickins authored
      
      
      Replace pagevecs in putback_lru_pages() and move_active_pages_to_lru()
      by lists of pages_to_free: then apply Konstantin Khlebnikov's
      free_hot_cold_page_list() to them instead of pagevec_release().
      
      Which simplifies the flow (no need to drop and retake lock whenever
      pagevec fills up) and reduces stale addresses in stack backtraces
      (which often showed through the pagevecs); but more importantly,
      removes another 120 bytes from the deepest stacks in page reclaim.
      Although I've not recently seen an actual stack overflow here with
      a vanilla kernel, move_active_pages_to_lru() has often featured in
      deep backtraces.
      
      However, free_hot_cold_page_list() does not handle compound pages
      (nor need it: a Transparent HugePage would have been split by the
      time it reaches the call in shrink_page_list()), but it is possible
      for putback_lru_pages() or move_active_pages_to_lru() to be left
      holding the last reference on a THP, so must exclude the unlikely
      compound case before putting on pages_to_free.
      
      Remove pagevec_strip(), its work now done in move_active_pages_to_lru().
      The pagevec in scan_mapping_unevictable_pages() remains in mm/vmscan.c,
      but that is never on the reclaim path, and cannot be replaced by a list.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bcf8879
    • Hugh Dickins's avatar
      memcg: fix mem_cgroup_print_bad_page · 90b3feae
      Hugh Dickins authored
      
      
      If DEBUG_VM, mem_cgroup_print_bad_page() is called whenever bad_page()
      shows a "Bad page state" message, removes page from circulation, adds a
      taint and continues.  This is at a very low level, often when a spinlock
      is held (sometimes when page table lock is held, for example).
      
      We want to recover from this badness, not make it worse: we must not
      kmalloc memory here, we must not do a cgroup path lookup via dubious
      pointers.  No doubt that code was useful to debug a particular case at one
      time, and may be again, but take it out of the mainline kernel.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90b3feae
    • Hugh Dickins's avatar
      memcg: fix split_huge_page_refcounts() · 12d27107
      Hugh Dickins authored
      
      
      This patch started off as a cleanup: __split_huge_page_refcounts() has to
      cope with two scenarios, when the hugepage being split is already on LRU,
      and when it is not; but why does it have to split that accounting across
      three different sites?  Consolidate it in lru_add_page_tail(), handling
      evictable and unevictable alike, and use standard add_page_to_lru_list()
      when accounting is needed (when the head is not yet on LRU).
      
      But a recent regression in -next, I guess the removal of PageCgroupAcctLRU
      test from mem_cgroup_split_huge_fixup(), makes this now a necessary fix:
      under load, the MEM_CGROUP_ZSTAT count was wrapping to a huge number,
      messing up reclaim calculations and causing a freeze at rmdir of cgroup.
      
      Add a VM_BUG_ON to mem_cgroup_lru_del_list() when we're about to wrap that
      count - this has not been the only such incident.  Document that
      lru_add_page_tail() is for Transparent HugePages by #ifdef around it.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12d27107
    • Mel Gorman's avatar
      mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone · 0cee34fd
      Mel Gorman authored
      If compaction can proceed for a given zone, shrink_zones() does not
      reclaim any more pages from it.  After commit [e0c23279
      
      : vmscan: abort
      reclaim/compaction if compaction can proceed], do_try_to_free_pages()
      tries to finish as soon as possible once one zone can compact.
      
      This was intended to prevent slabs being shrunk unnecessarily but there
      are side-effects.  One is that a small zone that is ready for compaction
      will abort reclaim even if the chances of successfully allocating a THP
      from that zone is small.  It also means that reclaim can return too early
      even though sc->nr_to_reclaim pages were not reclaimed.
      
      This partially reverts the commit until it is proven that slabs are really
      being shrunk unnecessarily but preserves the check to return 1 to avoid
      OOM if reclaim was aborted prematurely.
      
      [aarcange@redhat.com: This patch replaces a revert from Andrea]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cee34fd
    • Mel Gorman's avatar
      mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available · fe4b1b24
      Mel Gorman authored
      In commit e0887c19
      
       ("vmscan: limit direct reclaim for higher order
      allocations"), Rik noted that reclaim was too aggressive when THP was
      enabled.  In his initial patch he used the number of free pages to decide
      if reclaim should abort for compaction.  My feedback was that reclaim and
      compaction should be using the same logic when deciding if reclaim should
      be aborted.
      
      Unfortunately, this had the effect of reducing THP success rates when the
      workload included something like streaming reads that continually
      allocated pages.  The window during which compaction could run and return
      a THP was too small.
      
      This patch combines Rik's two patches together.  compaction_suitable() is
      still used to decide if reclaim should be aborted to allow compaction is
      used.  However, it will also ensure that there is a reasonable buffer of
      free pages available.  This improves upon the THP allocation success rates
      but bounds the number of pages that are freed for compaction.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe4b1b24
    • Mel Gorman's avatar
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman authored
      
      
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
    • Mel Gorman's avatar
      mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred · 66199712
      Mel Gorman authored
      
      
      If compaction is deferred, direct reclaim is used to try to free enough
      pages for the allocation to succeed.  For small high-orders, this has a
      reasonable chance of success.  However, if the caller has specified
      __GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
      to fail the allocation rather than stall the caller in direct reclaim.
      This patch skips direct reclaim if compaction is deferred and the caller
      specifies __GFP_NO_KSWAPD.
      
      Async compaction only considers a subset of pages so it is possible for
      compaction to be deferred prematurely and not enter direct reclaim even in
      cases where it should.  To compensate for this, this patch also defers
      compaction only if sync compaction failed.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66199712
    • Mel Gorman's avatar
      mm: compaction: make isolate_lru_page() filter-aware again · c8244935
      Mel Gorman authored
      Commit 39deaf85
      
       ("mm: compaction: make isolate_lru_page() filter-aware")
      noted that compaction does not migrate dirty or writeback pages and that
      is was meaningless to pick the page and re-add it to the LRU list.  This
      had to be partially reverted because some dirty pages can be migrated by
      compaction without blocking.
      
      This patch updates "mm: compaction: make isolate_lru_page" by skipping
      over pages that migration has no possibility of migrating to minimise LRU
      disruption.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8244935
    • Mel Gorman's avatar
      mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage · b969c4ab
      Mel Gorman authored
      
      
      Asynchronous compaction is used when allocating transparent hugepages to
      avoid blocking for long periods of time.  Due to reports of stalling,
      there was a debate on disabling synchronous compaction but this severely
      impacted allocation success rates.  Part of the reason was that many dirty
      pages are skipped in asynchronous compaction by the following check;
      
      	if (PageDirty(page) && !sync &&
      		mapping->a_ops->migratepage != migrate_page)
      			rc = -EBUSY;
      
      This skips over all mapping aops using buffer_migrate_page() even though
      it is possible to migrate some of these pages without blocking.  This
      patch updates the ->migratepage callback with a "sync" parameter.  It is
      the responsibility of the callback to fail gracefully if migration would
      block.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b969c4ab
    • Mel Gorman's avatar
      mm: vmscan: do not OOM if aborting reclaim to start compaction · 7335084d
      Mel Gorman authored
      
      
      During direct reclaim it is possible that reclaim will be aborted so that
      compaction can be attempted to satisfy a high-order allocation.  If this
      decision is made before any pages are reclaimed, it is possible that 0 is
      returned to the page allocator potentially triggering an OOM.  This has
      not been observed but it is a possibility so this patch addresses it.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7335084d
    • Andrea Arcangeli's avatar
      mm: vmscan: check if we isolated a compound page during lumpy scan · 50134731
      Andrea Arcangeli authored
      
      
      Properly take into account if we isolated a compound page during the lumpy
      scan in reclaim and skip over the tail pages when encountered.  This
      corrects the values given to the tracepoint for number of lumpy pages
      isolated and will avoid breaking the loop early if compound pages smaller
      than the requested allocation size are requested.
      
      [mgorman@suse.de: Updated changelog]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50134731
    • Mel Gorman's avatar
      mm: compaction: use synchronous compaction for /proc/sys/vm/compact_memory · b16d3d5a
      Mel Gorman authored
      
      
      When asynchronous compaction was introduced, the
      /proc/sys/vm/compact_memory handler should have been updated to always use
      synchronous compaction.  This did not happen so this patch addresses it.
      
      The assumption is if a user writes to /proc/sys/vm/compact_memory, they
      are willing for that process to stall.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b16d3d5a
    • Mel Gorman's avatar
      mm: compaction: allow compaction to isolate dirty pages · a77ebd33
      Mel Gorman authored
      Short summary: There are severe stalls when a USB stick using VFAT is
      used with THP enabled that are reduced by this series.  If you are
      experiencing this problem, please test and report back and considering I
      have seen complaints from openSUSE and Fedora users on this as well as a
      few private mails, I'm guessing it's a widespread issue.  This is a new
      type of USB-related stall because it is due to synchronous compaction
      writing where as in the past the big problem was dirty pages reaching
      the end of the LRU and being written by reclaim.
      
      Am cc'ing Andrew this time and this series would replace
      mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
      I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
      for wider testing and ideally it would be reverted and replaced by this
      series.
      
      That said, the later patches could really do with some review.  If this
      series is not the answer then a new direction needs to be discussed
      because as it is, the stalls are unacceptable as the results in this
      leader show.
      
      For testers that try backporting this to 3.1, it won't work because
      there is a non-obvious dependency on not writing back pages in direct
      reclaim so you need those patches too.
      
      Changelog since V5
      o Rebase to 3.2-rc5
      o Tidy up the changelogs a bit
      
      Changelog since V4
      o Added reviewed-bys, credited Andrea properly for sync-light
      o Allow dirty pages without mappings to be considered for migration
      o Bound the number of pages freed for compaction
      o Isolate PageReclaim pages on their own LRU list
      
      This is against 3.2-rc5 and follows on from discussions on "mm: Do
      not stall in synchronous compaction for THP allocations" and "[RFC
      PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
      patch eliminated stalls due to compaction which sometimes resulted in
      user-visible interactivity problems on browsers by simply never using
      sync compaction. The downside was that THP success allocation rates
      were lower because dirty pages were not being migrated as reported by
      Andrea. His approach at fixing this was nacked on the grounds that
      it reverted fixes from Rik merged that reduced the amount of pages
      reclaimed as it severely impacted his workloads performance.
      
      This series attempts to reconcile the requirements of maximising THP
      usage, without stalling in a user-visible fashion due to compaction
      or cheating by reclaiming an excessive number of pages.
      
      Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
      	dirty pages. This is because migration can move some dirty
      	pages without blocking.
      
      Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
      	synchronous compaction when it should be. This is unrelated
      	to the reported stalls but is worth fixing.
      
      Patch 3 checks if we isolated a compound page during lumpy scan and
      	account for it properly. For the most part, this affects
      	tracing so it's unrelated to the stalls but worth fixing.
      
      Patch 4 notes that it is possible to abort reclaim early for compaction
      	and return 0 to the page allocator potentially entering the
      	"may oom" path. This has not been observed in practice but
      	the rest of the series potentially makes it easier to happen.
      
      Patch 5 adds a sync parameter to the migratepage callback and gives
      	the callback responsibility for migrating the page without
      	blocking if sync==false. For example, fallback_migrate_page
      	will not call writepage if sync==false. This increases the
      	number of pages that can be handled by asynchronous compaction
      	thereby reducing stalls.
      
      Patch 6 restores filter-awareness to isolate_lru_page for migration.
      	In practice, it means that pages under writeback and pages
      	without a ->migratepage callback will not be isolated
      	for migration.
      
      Patch 7 avoids calling direct reclaim if compaction is deferred but
      	makes sure that compaction is only deferred if sync
      	compaction was used.
      
      Patch 8 introduces a sync-light migration mechanism that sync compaction
      	uses. The objective is to allow some stalls but to not call
      	->writepage which can lead to significant user-visible stalls.
      
      Patch 9 notes that while we want to abort reclaim ASAP to allow
      	compation to go ahead that we leave a very small window of
      	opportunity for compaction to run. This patch allows more pages
      	to be freed by reclaim but bounds the number to a reasonable
      	level based on the high watermark on each zone.
      
      Patch 10 allows slabs to be shrunk even after compaction_ready() is
      	true for one zone. This is to avoid a problem whereby a single
      	small zone can abort reclaim even though no pages have been
      	reclaimed and no suitably large zone is in a usable state.
      
      Patch 11 fixes a problem with the rate of page scanning. As reclaim is
      	rarely stalling on pages under writeback it means that scan
      	rates are very high. This is particularly true for direct
      	reclaim which is not calling writepage. The vmstat figures
      	implied that much of this was busy work with PageReclaim pages
      	marked for immediate reclaim. This patch is a prototype that
      	moves these pages to their own LRU list.
      
      This has been tested and other than 2 USB keys getting trashed,
      nothing horrible fell out. That said, I am a bit unhappy with the
      rescue logic in patch 11 but did not find a better way around it. It
      does significantly reduce scan rates and System CPU time indicating
      it is the right direction to take.
      
      What is of critical importance is that stalls due to compaction
      are massively reduced even though sync compaction was still
      allowed. Testing from people complaining about stalls copying to USBs
      with THP enabled are particularly welcome.
      
      The following tests all involve THP usage and USB keys in some
      way. Each test follows this type of pattern
      
      1. Read from some fast fast storage, be it raw device or file. Each time
         the copy finishes, start again until the test ends
      2. Write a large file to a filesystem on a USB stick. Each time the copy
         finishes, start again until the test ends
      3. When memory is low, start an alloc process that creates a mapping
         the size of physical memory to stress THP allocation. This is the
         "real" part of the test and the part that is meant to trigger
         stalls when THP is enabled. Copying continues in the background.
      4. Record the CPU usage and time to execute of the alloc process
      5. Record the number of THP allocs and fallbacks as well as the number of THP
         pages in use a the end of the test just before alloc exited
      6. Run the test 5 times to get an idea of variability
      7. Between each run, sync is run and caches dropped and the test
         waits until nr_dirty is a small number to avoid interference
         or caching between iterations that would skew the figures.
      
      The individual tests were then
      
      writebackCPDeviceBasevfat
      	Disable THP, read from a raw device (sda), vfat on USB stick
      writebackCPDeviceBaseext4
      	Disable THP, read from a raw device (sda), ext4 on USB stick
      writebackCPDevicevfat
      	THP enabled, read from a raw device (sda), vfat on USB stick
      writebackCPDeviceext4
      	THP enabled, read from a raw device (sda), ext4 on USB stick
      writebackCPFilevfat
      	THP enabled, read from a file on fast storage and USB, both vfat
      writebackCPFileext4
      	THP enabled, read from a file on fast storage and USB, both ext4
      
      The kernels tested were
      
      3.1		3.1
      vanilla		3.2-rc5
      freemore	Patches 1-10
      immediate	Patches 1-11
      andrea		The 8 patches Andrea posted as a basis of comparison
      
      The results are very long unfortunately. I'll start with the case
      where we are not using THP at all
      
      writebackCPDeviceBasevfat
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.28 (    0.00%)   54.49 (-4143.46%)   48.63 (-3687.69%)    4.69 ( -265.11%)   51.88 (-3940.81%)
      +/-                 0.06 (    0.00%)    2.45 (-4305.55%)    4.75 (-8430.57%)    7.46 (-13282.76%)    4.76 (-8440.70%)
      User Time           0.09 (    0.00%)    0.05 (   40.91%)    0.06 (   29.55%)    0.07 (   15.91%)    0.06 (   27.27%)
      +/-                 0.02 (    0.00%)    0.01 (   45.39%)    0.02 (   25.07%)    0.00 (   77.06%)    0.01 (   52.24%)
      Elapsed Time      110.27 (    0.00%)   56.38 (   48.87%)   49.95 (   54.70%)   11.77 (   89.33%)   53.43 (   51.54%)
      +/-                 7.33 (    0.00%)    3.77 (   48.61%)    4.94 (   32.63%)    6.71 (    8.50%)    4.76 (   35.03%)
      THP Active          0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      Fault Alloc         0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      Fault Fallback      0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      
      The THP figures are obviously all 0 because THP was enabled. The
      main thing to watch is the elapsed times and how they compare to
      times when THP is enabled later. It's also important to note that
      elapsed time is improved by this series as System CPu time is much
      reduced.
      
      writebackCPDevicevfat
      
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.22 (    0.00%)   13.89 (-1040.72%)   46.40 (-3709.20%)    4.44 ( -264.37%)   47.37 (-3789.33%)
      +/-                 0.06 (    0.00%)   22.82 (-37635.56%)    3.84 (-6249.44%)    6.48 (-10618.92%)    6.60
      (-10818.53%)
      User Time           0.06 (    0.00%)    0.06 (   -6.90%)    0.05 (   17.24%)    0.05 (   13.79%)    0.04 (   31.03%)
      +/-                 0.01 (    0.00%)    0.01 (   33.33%)    0.01 (   33.33%)    0.01 (   39.14%)    0.01 (   25.46%)
      Elapsed Time     10445.54 (    0.00%) 2249.92 (   78.46%)   70.06 (   99.33%)   16.59 (   99.84%)  472.43 (
      95.48%)
      +/-               643.98 (    0.00%)  811.62 (  -26.03%)   10.02 (   98.44%)    7.03 (   98.91%)   59.99 (   90.68%)
      THP Active         15.60 (    0.00%)   35.20 (  225.64%)   65.00 (  416.67%)   70.80 (  453.85%)   62.20 (  398.72%)
      +/-                18.48 (    0.00%)   51.29 (  277.59%)   15.99 (   86.52%)   37.91 (  205.18%)   22.02 (  119.18%)
      Fault Alloc       121.80 (    0.00%)   76.60 (   62.89%)  155.40 (  127.59%)  181.20 (  148.77%)  286.60 (  235.30%)
      +/-                73.51 (    0.00%)   61.11 (   83.12%)   34.89 (   47.46%)   31.88 (   43.36%)   68.13 (   92.68%)
      Fault Fallback    881.20 (    0.00%)  926.60 (   -5.15%)  847.60 (    3.81%)  822.00 (    6.72%)  716.60 (   18.68%)
      +/-                73.51 (    0.00%)   61.26 (   16.67%)   34.89 (   52.54%)   31.65 (   56.94%)   67.75 (    7.84%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)       3540.88   1945.37    716.04     64.97   1937.03
      Total Elapsed Time (seconds)              52417.33  11425.90    501.02    230.95   2520.28
      
      The first thing to note is the "Elapsed Time" for the vanilla kernels
      of 2249 seconds versus 56 with THP disabled which might explain the
      reports of USB stalls with THP enabled. Applying the patches brings
      performance in line with THP-disabled performance while isolating
      pages for immediate reclaim from the LRU cuts down System CPU time.
      
      The "Fault Alloc" success rate figures are also improved. The vanilla
      kernel only managed to allocate 76.6 pages on average over the course
      of 5 iterations where as applying the series allocated 181.20 on
      average albeit it is well within variance. It's worth noting that
      applies the series at least descreases the amount of variance which
      implies an improvement.
      
      Andrea's series had a higher success rate for THP allocations but
      at a severe cost to elapsed time which is still better than vanilla
      but still much worse than disabling THP altogether. One can bring my
      series close to Andrea's by removing this check
      
              /*
               * If compaction is deferred for high-order allocations, it is because
               * sync compaction recently failed. In this is the case and the caller
               * has requested the system not be heavily disrupted, fail the
               * allocation now instead of entering direct reclaim
               */
              if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
                      goto nopage;
      
      I didn't include a patch that removed the above check because hurting
      overall performance to improve the THP figure is not what the average
      user wants. It's something to consider though if someone really wants
      to maximise THP usage no matter what it does to the workload initially.
      
      This is summary of vmstat figures from the same test.
      
                                             3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
      Page Ins                                  3257266139  1111844061    17263623    10901575   161423219
      Page Outs                                   81054922    30364312     3626530     3657687     8753730
      Swap Ins                                        3294        2851        6560        4964        4592
      Swap Outs                                     390073      528094      620197      790912      698285
      Direct pages scanned                      1077581700  3024951463  1764930052   115140570  5901188831
      Kswapd pages scanned                        34826043     7112868     2131265     1686942     1893966
      Kswapd pages reclaimed                      28950067     4911036     1246044      966475     1497726
      Direct pages reclaimed                     805148398   280167837     3623473     2215044    40809360
      Kswapd efficiency                                83%         69%         58%         57%         79%
      Kswapd velocity                              664.399     622.521    4253.852    7304.360     751.490
      Direct efficiency                                74%          9%          0%          1%          0%
      Direct velocity                            20557.737  264745.137 3522673.849  498551.938 2341481.435
      Percentage direct scans                          96%         99%         99%         98%         99%
      Page writes by reclaim                        722646      529174      620319      791018      699198
      Page writes file                              332573        1080         122         106         913
      Page writes anon                              390073      528094      620197      790912      698285
      Page reclaim immediate                             0  2552514720  1635858848   111281140  5478375032
      Page rescued immediate                             0           0           0       87848           0
      Slabs scanned                                  23552       23552        9216        8192        9216
      Direct inode steals                              231           0           0           0           0
      Kswapd inode steals                                0           0           0           0           0
      Kswapd skipped wait                            28076         786           0          61           6
      THP fault alloc                                  609         383         753         906        1433
      THP collapse alloc                                12           6           0           0           6
      THP splits                                       536         211         456         593        1136
      THP fault fallback                              4406        4633        4263        4110        3583
      THP collapse fail                                120         127           0           0           4
      Compaction stalls                               1810         728         623         779        3200
      Compaction success                               196          53          60          80         123
      Compaction failures                             1614         675         563         699        3077
      Compaction pages moved                        193158       53545      243185      333457      226688
      Compaction move failure                         9952        9396       16424       23676       45070
      
      The main things to look at are
      
      1. Page In/out figures are much reduced by the series.
      
      2. Direct page scanning is incredibly high (264745.137 pages scanned
         per second on the vanilla kernel) but isolating PageReclaim pages
         on their own list reduces the number of pages scanned significantly.
      
      3. The fact that "Page rescued immediate" is a positive number implies
         that we sometimes race removing pages from the LRU_IMMEDIATE list
         that need to be put back on a normal LRU but it happens only for
         0.07% of the pages marked for immediate reclaim.
      
      writebackCPDeviceext4
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
      +/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
      User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
      +/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
      Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
      +/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
      THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
      +/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
      Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
      +/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
      Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
      +/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
      Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
      
      Similar test but the USB stick is using ext4 instead of vfat. As
      ext4 does not use writepage for migration, the large stalls due to
      compaction when THP is enabled are not observed. Still, isolating
      PageReclaim pages on their own list helped completion time largely
      by reducing the number of pages scanned by direct reclaim although
      time spend in congestion_wait could also be a factor.
      
      Again, Andrea's series had far higher success rates for THP allocation
      at the cost of elapsed time. I didn't look too closely but a quick
      look at the vmstat figures tells me kswapd reclaimed 8 times more pages
      than the patch series and direct reclaim reclaimed roughly three times
      as many pages. It follows that if memory is aggressively reclaimed,
      there will be more available for THP.
      
      writebackCPFilevfat
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.76 (    0.00%)   29.10 (-1555.52%)   46.01 (-2517.18%)    4.79 ( -172.35%)   54.89 (-3022.53%)
      +/-                 0.14 (    0.00%)   25.61 (-18185.17%)    2.15 (-1434.83%)    6.60 (-4610.03%)    9.75
      (-6863.76%)
      User Time           0.05 (    0.00%)    0.07 (  -45.83%)    0.05 (   -4.17%)    0.06 (  -29.17%)    0.06 (  -16.67%)
      +/-                 0.02 (    0.00%)    0.02 (   20.11%)    0.02 (   -3.14%)    0.01 (   31.58%)    0.01 (   47.41%)
      Elapsed Time     22520.79 (    0.00%) 1082.85 (   95.19%)   73.30 (   99.67%)   32.43 (   99.86%)  291.84 (  98.70%)
      +/-              7277.23 (    0.00%)  706.29 (   90.29%)   19.05 (   99.74%)   17.05 (   99.77%)  125.55 (   98.27%)
      THP Active         83.80 (    0.00%)   12.80 (   15.27%)   15.60 (   18.62%)   13.00 (   15.51%)    0.80 (    0.95%)
      +/-                66.81 (    0.00%)   20.19 (   30.22%)    5.92 (    8.86%)   15.06 (   22.54%)    1.17 (    1.75%)
      Fault Alloc       171.00 (    0.00%)   67.80 (   39.65%)   97.40 (   56.96%)  125.60 (   73.45%)  133.00 (   77.78%)
      +/-                82.91 (    0.00%)   30.69 (   37.02%)   53.91 (   65.02%)   55.05 (   66.40%)   21.19 (   25.56%)
      Fault Fallback    832.00 (    0.00%)  935.20 (  -12.40%)  906.00 (   -8.89%)  877.40 (   -5.46%)  870.20 (   -4.59%)
      +/-                82.91 (    0.00%)   30.69 (   62.98%)   54.01 (   34.86%)   55.05 (   33.60%)   20.91 (   74.78%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)       7229.81    928.42    704.52     80.68   1330.76
      Total Elapsed Time (seconds)             112849.04   5618.69    571.11    360.54   1664.28
      
      In this case, the test is reading/writing only from filesystems but as
      it's vfat, it's slow due to calling writepage during compaction. Little
      to observe really - the time to complete the test goes way down
      with the series applied and THP allocation success rates go up in
      comparison to 3.2-rc5.  The success rates are lower than 3.1.0 but
      the elapsed time for that kernel is abysmal so it is not really a
      sensible comparison.
      
      As before, Andrea's series allocates more THPs at the cost of overall
      performance.
      
      writebackCPFileext4
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
      +/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
      User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
      +/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
      Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
      +/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
      THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
      +/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
      Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
      +/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
      Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
      +/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
      Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
      
      Same type of story - elapsed times go down. In this case, allocation
      success rates are roughtly the same. As before, Andrea's has higher
      success rates but takes a lot longer.
      
      Overall the series does reduce latencies and while the tests are
      inherency racy as alloc competes with the cp processes, the variability
      was included. The THP allocation rates are not as high as they could
      be but that is because we would have to be more aggressive about
      reclaim and compaction impacting overall performance.
      
      This patch:
      
      Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
      noted that compaction does not migrate dirty or writeback pages and that
      is was meaningless to pick the page and re-add it to the LRU list.
      
      What was missed during review is that asynchronous migration moves dirty
      pages if their ->migratepage callback is migrate_page() because these can
      be moved without blocking.  This potentially impacted hugepage allocation
      success rates by a factor depending on how many dirty pages are in the
      system.
      
      This patch partially reverts 39deaf85
      
       to allow migration to isolate dirty
      pages again.  This increases how much compaction disrupts the LRU but that
      is addressed later in the series.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a77ebd33
    • Tao Ma's avatar
      vmscan/trace: Add 'file' info to trace_mm_vmscan_lru_isolate() · ea4d349f
      Tao Ma authored
      
      
      In trace_mm_vmscan_lru_isolate(), we don't output 'file' information to
      the trace event and it is a bit inconvenient for the user to get the
      real information(like pasted below).  mm_vmscan_lru_isolate:
      isolate_mode=2 order=0 nr_requested=32 nr_scanned=32 nr_taken=32
      contig_taken=0 contig_dirty=0 contig_failed=0
      
      'active' can be obtained by analyzing mode(Thanks go to Minchan and
      Mel), So this patch adds 'file' to the trace event and it now looks
      like: mm_vmscan_lru_isolate: isolate_mode=2 order=0 nr_requested=32
      nr_scanned=32 nr_taken=32 contig_taken=0 contig_dirty=0 contig_failed=0
      file=0
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea4d349f
    • Shaohua Li's avatar
      thp: improve order in lru list for split huge page · 45676885
      Shaohua Li authored
      
      
      Put the tail subpages of an isolated hugepage under splitting in the lru
      reclaim head as they supposedly should be isolated too next.
      
      Queues the subpages in physical order in the lru for non isolated
      hugepages under splitting.  That might provide some theoretical cache
      benefit to the buddy allocator later.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45676885
    • Shaohua Li's avatar
      thp: add tlb_remove_pmd_tlb_entry · f21760b1
      Shaohua Li authored
      
      
      We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be
      flushed, but not a corresponding API for pmd entry.  This isn't a
      problem so far because THP is only for x86 currently and tlb_flush()
      under x86 will flush entire TLB.  But this is confusion and could be
      missed if thp is ported to other arch.
      
      Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in
      __tlb_remove_page() as suggested by Andrea Arcangeli.  The
      __tlb_remove_page() function is supposed to be called after
      tlb_remove_xxx_tlb_entry() and we can catch any misuse.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f21760b1
    • Shaohua Li's avatar
      thp: remove unnecessary tlb flush for mprotect · e5591307
      Shaohua Li authored
      
      
      change_protection() will do TLB flush later, don't need duplicate tlb
      flush.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5591307
    • Shaohua Li's avatar
      thp: improve the error code path · 569e5590
      Shaohua Li authored
      
      
      Improve the error code path.  Delete unnecessary sysfs file for example.
      Also remove the #ifdef xxx to make code better.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      569e5590
    • Bob Liu's avatar
      page_cgroup: drop multi CONFIG_MEMORY_HOTPLUG · 0efc8eb9
      Bob Liu authored
      
      
      No need for two CONFIG_MEMORY_HOTPLUG blocks.
      Signed-off-by: default avatarBob Liu <lliubbo@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0efc8eb9
    • Bob Liu's avatar
      page_alloc: break early in check_for_regular_memory() · d0048b0e
      Bob Liu authored
      
      
      If there is a zone below ZONE_NORMAL has present_pages, we can set node
      state to N_NORMAL_MEMORY, no need to loop to end.
      Signed-off-by: default avatarBob Liu <lliubbo@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0048b0e
    • Bob Liu's avatar
      memcg: cleanup for_each_node_state() · 3ed28fa1
      Bob Liu authored
      
      
      We already have for_each_node(node) define in nodemask.h, better to use it.
      Signed-off-by: default avatarBob Liu <lliubbo@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ed28fa1
    • KAMEZAWA Hiroyuki's avatar
      memcg: simplify LRU handling by new rule · 38c5d72f
      KAMEZAWA Hiroyuki authored
      
      
      Now, at LRU handling, memory cgroup needs to do complicated works to see
      valid pc->mem_cgroup, which may be overwritten.
      
      This patch is for relaxing the protocol. This patch guarantees
         - when pc->mem_cgroup is overwritten, page must not be on LRU.
      
      By this, LRU routine can believe pc->mem_cgroup and don't need to check
      bits on pc->flags.  This new rule may adds small overheads to swapin.  But
      in most case, lru handling gets faster.
      
      After this patch, PCG_ACCT_LRU bit is obsolete and removed.
      
      [akpm@linux-foundation.org: remove unneeded VM_BUG_ON(), restore hannes's christmas tree]
      [akpm@linux-foundation.org: clean up code comment]
      [hughd@google.com: fix NULL mem_cgroup_try_charge]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38c5d72f
    • KAMEZAWA Hiroyuki's avatar
      memcg: clear pc->mem_cgroup if necessary. · 4e5f01c2
      KAMEZAWA Hiroyuki authored
      
      
      This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
      and reducing atomic ops/complexity in memcg LRU handling.
      
      In some cases, pages are added to lru before charge to memcg and pages
      are not classfied to memory cgroup at lru addtion.  Now, the lru where
      the page should be added is determined a bit in page_cgroup->flags and
      pc->mem_cgroup.  I'd like to remove the check of flag.
      
      To handle the case pc->mem_cgroup may contain stale pointers if pages
      are added to LRU before classification.  This patch resets
      pc->mem_cgroup to root_mem_cgroup before lru additions.
      
      [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
      [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
      [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
      [hughd@google.com: stop oops in mem_cgroup_reset_owner()]
      [hughd@google.com: fix page migration to reset_owner]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e5f01c2
    • KAMEZAWA Hiroyuki's avatar
      memcg: simplify corner case handling of LRU. · 36b62ad5
      KAMEZAWA Hiroyuki authored
      
      
      This patch simplifies LRU handling of racy case (memcg+SwapCache).  At
      charging, SwapCache tend to be on LRU already.  So, before overwriting
      pc->mem_cgroup, the page must be removed from LRU and added to LRU
      later.
      
      This patch does
              spin_lock(zone->lru_lock);
              if (PageLRU(page))
                      remove from LRU
              overwrite pc->mem_cgroup
              if (PageLRU(page))
                      add to new LRU.
              spin_unlock(zone->lru_lock);
      
      And guarantee all pages are not on LRU at modifying pc->mem_cgroup.
      This patch also unfies lru handling of replace_page_cache() and
      swapin.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Ying Han <yinghan@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36b62ad5
    • KAMEZAWA Hiroyuki's avatar
      memcg: simplify page cache charging · dc67d504
      KAMEZAWA Hiroyuki authored
      This patch is a clean up. No functional/logical changes.
      
      Because of commit ef6a3c63
      
       ("mm: add replace_page_cache_page()
      function") , FUSE uses replace_page_cache() instead of
      add_to_page_cache().  Then, mem_cgroup_cache_charge() is not called
      against FUSE's pages from splice.
      
      So now, mem_cgroup_cache_charge() gets pages that are not on the LRU
      with the exception of PageSwapCache pages.  For checking,
      WARN_ON_ONCE(PageLRU(page)) is added.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Ying Han <yinghan@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc67d504
    • David Rientjes's avatar
      oom, memcg: fix exclusion of memcg threads after they have detached their mm · de077d22
      David Rientjes authored
      
      
      The oom killer relies on logic that identifies threads that have already
      been oom killed when scanning the tasklist and, if found, deferring
      until such threads have exited.  This is done by checking for any
      candidate threads that have the TIF_MEMDIE bit set.
      
      For memcg ooms, candidate threads are first found by calling
      task_in_mem_cgroup() since the oom killer should not defer if there's an
      oom killed thread in another memcg.
      
      Unfortunately, task_in_mem_cgroup() excludes threads if they have
      detached their mm in the process of exiting so TIF_MEMDIE is never
      detected for such conditions.  This is different for global, mempolicy,
      and cpuset oom conditions where a detached mm is only excluded after
      checking for TIF_MEMDIE and deferring, if necessary, in
      select_bad_process().
      
      The fix is to return true if a task has a detached mm but is still in
      the memcg or its hierarchy that is currently oom.  This will allow the
      oom killer to appropriately defer rather than kill unnecessarily or, in
      the worst case, panic the machine if nothing else is available to kill.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de077d22