1. 09 Sep, 2017 34 commits
  2. 07 Sep, 2017 6 commits
    • Rik van Riel's avatar
      mm,fork: introduce MADV_WIPEONFORK · d2cd9ede
      Rik van Riel authored
      Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
      in the child process after fork.  This differs from MADV_DONTFORK in one
      important way.
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      MADV_WIPEONFORK only works on private, anonymous VMAs.
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      It would be better to have the kernel take care of this automatically.
      The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
      Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.comSigned-off-by: default avatarRik van Riel <riel@redhat.com>
      Reported-by: default avatarFlorian Weimer <fweimer@redhat.com>
      Reported-by: default avatarColm MacCártaigh <colm@allcosts.net>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Huang Ying's avatar
      mm: hugetlb: clear target sub-page last when clearing huge page · c79b57e4
      Huang Ying authored
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      clearing huge page on x86_64 platform, the cache footprint is 2M.  But
      on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
      LLC (last level cache).  That is, in average, there are 2.5M LLC for
      each core and 1.25M LLC for each thread.
      If the cache pressure is heavy when clearing the huge page, and we clear
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing clearing the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after clearing the huge page.
      To help the above situation, in this patch, when we clear a huge page,
      the order to clear sub-pages is changed.  In quite some situation, we
      can get the address that the application will access after we clear the
      huge page, for example, in a page fault handler.  Instead of clearing
      the huge page from begin to end, we will clear the sub-pages farthest
      from the the sub-page to access firstly, and clear the sub-page to
      access last.  This will make the sub-page to access most cache-hot and
      sub-pages around it more cache-hot too.  If we cannot know the address
      the application will access, the begin of the huge page is assumed to be
      the the address the application will access.
      With this patch, the throughput increases ~28.3% in vm-scalability
      anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case creates 72 processes, each
      process mmap a big anonymous memory area and writes to it from the begin
      to the end.  For each process, other processes could be seen as other
      workload which generates heavy cache pressure.  At the same time, the
      cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per
      cycle) increased from 0.56 to 0.74, and the time spent in user space is
      reduced ~7.9%
      Christopher Lameter suggests to clear bytes inside a sub-page from end
      to begin too.  But tests show no visible performance difference in the
      tests.  May because the size of page is small compared with the cache
      Thanks Andi Kleen to propose to use address to access to determine the
      order of sub-pages to clear.
      The hugetlbfs access address could be improved, will do that in another
      [ying.huang@intel.com: improve readability of clear_huge_page()]
        Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com
      Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.comSuggested-by: default avatarAndi Kleen <andi.kleen@intel.com>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Andrea Arcangeli's avatar
      mm: oom: let oom_reap_task and exit_mmap run concurrently · 21292580
      Andrea Arcangeli authored
      This is purely required because exit_aio() may block and exit_mmap() may
      never start, if the oom_reap_task cannot start running on a mm with
      mm_users == 0.
      At the same time if the OOM reaper doesn't wait at all for the memory of
      the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
      generate a spurious OOM kill.
      If it wasn't because of the exit_aio or similar blocking functions in
      the last mmput, it would be enough to change the oom_reap_task() in the
      case it finds mm_users == 0, to wait for a timeout or to wait for
      __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
      problem here so the concurrency of exit_mmap and oom_reap_task is
      apparently warranted.
      It's a non standard runtime, exit_mmap() runs without mmap_sem, and
      oom_reap_task runs with the mmap_sem for reading as usual (kind of
      The race between the two is solved with a combination of
      tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
      (serialized by a dummy down_write/up_write cycle on the same lines of
      the ksm_exit method).
      If the oom_reap_task() may be running concurrently during exit_mmap,
      exit_mmap will wait it to finish in down_write (before taking down mm
      structures that would make the oom_reap_task fail with use after free).
      If exit_mmap comes first, oom_reap_task() will skip the mm if
      MMF_OOM_SKIP is already set and in turn all memory is already freed and
      furthermore the mm data structures may already have been taken down by
      [aarcange@redhat.com: incremental one liner]
        Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
      [rientjes@google.com: remove unused mmput_async]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
      [aarcange@redhat.com: microoptimization]
        Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
      Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
      Fixes: 26db62f1 ("oom: keep mm of the killed task available")
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Tested-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Aaron Lu's avatar
      swap: choose swap device according to numa node · a2468cc9
      Aaron Lu authored
      If the system has more than one swap device and swap device has the node
      information, we can make use of this information to decide which swap
      device to use in get_swap_pages() to get better performance.
      The current code uses a priority based list, swap_avail_list, to decide
      which swap device to use and if multiple swap devices share the same
      priority, they are used round robin.  This patch changes the previous
      single global swap_avail_list into a per-numa-node list, i.e.  for each
      numa node, it sees its own priority based list of available swap
      devices.  Swap device's priority can be promoted on its matching node's
      The current swap device's priority is set as: user can set a >=0 value,
      or the system will pick one starting from -1 then downwards.  The
      priority value in the swap_avail_list is the negated value of the swap
      device's due to plist being sorted from low to high.  The new policy
      doesn't change the semantics for priority >=0 cases, the previous
      starting from -1 then downwards now becomes starting from -2 then
      downwards and -1 is reserved as the promoted value.
      Take 4-node EX machine as an example, suppose 4 swap devices are
      available, each sit on a different node:
      swapA on node 0
      swapB on node 1
      swapC on node 2
      swapD on node 3
      After they are all swapped on in the sequence of ABCD.
      Current behaviour:
      their priorities will be:
      swapA: -1
      swapB: -2
      swapC: -3
      swapD: -4
      And their position in the global swap_avail_list will be:
      swapA   -> swapB   -> swapC   -> swapD
      prio:1     prio:2     prio:3     prio:4
      New behaviour:
      their priorities will be(note that -1 is skipped):
      swapA: -2
      swapB: -3
      swapC: -4
      swapD: -5
      And their positions in the 4 swap_avail_lists[nid] will be:
      swap_avail_lists[0]: /* node 0's available swap device list */
      swapA   -> swapB   -> swapC   -> swapD
      prio:1     prio:3     prio:4     prio:5
      swap_avali_lists[1]: /* node 1's available swap device list */
      swapB   -> swapA   -> swapC   -> swapD
      prio:1     prio:2     prio:4     prio:5
      swap_avail_lists[2]: /* node 2's available swap device list */
      swapC   -> swapA   -> swapB   -> swapD
      prio:1     prio:2     prio:3     prio:5
      swap_avail_lists[3]: /* node 3's available swap device list */
      swapD   -> swapA   -> swapB   -> swapC
      prio:1     prio:2     prio:3     prio:4
      To see the effect of the patch, a test that starts N process, each mmap
      a region of anonymous memory and then continually write to it at random
      position to trigger both swap in and out is used.
      On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
      are used as swap devices with each attached to a different node, the
      result is:
      runtime=30m/processes=32/total test size=128G/each process mmap region=4G
      kernel         throughput
      vanilla        13306
      auto-binding   15169 +14%
      runtime=30m/processes=64/total test size=128G/each process mmap region=2G
      kernel         throughput
      vanilla        11885
      auto-binding   14879 +25%
      [aaron.lu@intel.com: v2]
        Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
        Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
      [akpm@linux-foundation.org: use kmalloc_array()]
      Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
      Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.comSigned-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: "Chen, Tim C" <tim.c.chen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Michal Hocko's avatar
      mm: replace TIF_MEMDIE checks by tsk_is_oom_victim · da99ecf1
      Michal Hocko authored
      TIF_MEMDIE is set only to the tasks whick were either directly selected
      by the OOM killer or passed through mark_oom_victim from the allocator
      path.  tsk_is_oom_victim is more generic and allows to identify all
      tasks (threads) which share the mm with the oom victim.
      Please note that the freezer still needs to check TIF_MEMDIE because we
      cannot thaw tasks which do not participage in oom_victims counting
      otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.
      Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Michal Hocko's avatar
      mm, oom: do not rely on TIF_MEMDIE for memory reserves access · cd04ae1e
      Michal Hocko authored
      For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
      victims and then, among other things, to give these threads full access
      to memory reserves.  There are few shortcomings of this implementation,
      First of all and the most serious one is that the full access to memory
      reserves is quite dangerous because we leave no safety room for the
      system to operate and potentially do last emergency steps to move on.
      Secondly this flag is per task_struct while the OOM killer operates on
      mm_struct granularity so all processes sharing the given mm are killed.
      Giving the full access to all these task_structs could lead to a quick
      memory reserves depletion.  We have tried to reduce this risk by giving
      TIF_MEMDIE only to the main thread and the currently allocating task but
      that doesn't really solve this problem while it surely opens up a room
      for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
      allocator without access to memory reserves because a particular thread
      was not the group leader.
      Now that we have the oom reaper and that all oom victims are reapable
      after 1b51e65e ("oom, oom_reaper: allow to reap mm shared by the
      kthreads") we can be more conservative and grant only partial access to
      memory reserves because there are reasonable chances of the parallel
      memory freeing.  We still want some access to reserves because we do not
      want other consumers to eat up the victim's freed memory.  oom victims
      will still contend with __GFP_HIGH users but those shouldn't be so
      aggressive to starve oom victims completely.
      Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
      the half of the reserves.  This makes the access to reserves independent
      on which task has passed through mark_oom_victim.  Also drop any usage
      of TIF_MEMDIE from the page allocator proper and replace it by
      tsk_is_oom_victim as well which will make page_alloc.c completely
      TIF_MEMDIE free finally.
      CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
      ALLOC_NO_WATERMARKS approach.
      There is a demand to make the oom killer memcg aware which will imply
      many tasks killed at once.  This change will allow such a usecase
      without worrying about complete memory reserves depletion.
      Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>