1. 09 Sep, 2017 40 commits
    • Shakeel Butt's avatar
      mm: fadvise: avoid fadvise for fs without backing device · 3a77d214
      Shakeel Butt authored
      The fadvise() manpage is silent on fadvise()'s effect on memory-based
      filesystems (shmem, hugetlbfs & ramfs) and pseudo file systems (procfs,
      sysfs, kernfs).  The current implementaion of fadvise is mostly a noop
      for such filesystems except for FADV_DONTNEED which will trigger
      expensive remote LRU cache draining.  This patch makes the noop of
      fadvise() on such file systems very explicit.
      
      However this change has two side effects for ramfs and one for tmpfs.
      First fadvise(FADV_DONTNEED) could remove the unmapped clean zero'ed
      pages of ramfs (allocated through read, readahead & read fault) and
      tmpfs (allocated through read fault).  Also fadvise(FADV_WILLNEED) could
      create such clean zero'ed pages for ramfs.  This change removes those
      possibilities.
      
      One of our generic libraries does fadvise(FADV_DONTNEED).  Recently we
      observed high latency in fadvise() and noticed that the users have
      started using tmpfs files and the latency was due to expensive remote
      LRU cache draining.  For normal tmpfs files (have data written on them),
      fadvise(FADV_DONTNEED) will always trigger the unneeded remote cache
      draining.
      
      Link: http://lkml.kernel.org/r/20170818011023.181465-1-shakeelb@google.comSigned-off-by: 's avatarShakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a77d214
    • Matthias Kaehlcke's avatar
      mm/zsmalloc.c: change stat type parameter to int · 3eb95fea
      Matthias Kaehlcke authored
      zs_stat_inc/dec/get() uses enum zs_stat_type for the stat type, however
      some callers pass an enum fullness_group value.  Change the type to int to
      reflect the actual use of the functions and get rid of 'enum-conversion'
      warnings
      
      Link: http://lkml.kernel.org/r/20170731175000.56538-1-mka@chromium.orgSigned-off-by: 's avatarMatthias Kaehlcke <mka@chromium.org>
      Reviewed-by: 's avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: 's avatarMinchan Kim <minchan@kernel.org>
      Cc: Doug Anderson <dianders@chromium.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eb95fea
    • Joonsoo Kim's avatar
      mm/mlock.c: use page_zone() instead of page_zone_id() · 9472f23c
      Joonsoo Kim authored
      page_zone_id() is a specialized function to compare the zone for the pages
      that are within the section range.  If the section of the pages are
      different, page_zone_id() can be different even if their zone is the same.
      This wrong usage doesn't cause any actual problem since
      __munlock_pagevec_fill() would be called again with failed index.
      However, it's better to use more appropriate function here.
      
      Link: http://lkml.kernel.org/r/1503559211-10259-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: 's avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: 's avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      9472f23c
    • Kemi Wang's avatar
      mm: consider the number in local CPUs when reading NUMA stats · 63803222
      Kemi Wang authored
      To avoid deviation, the per cpu number of NUMA stats in
      vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.
      
      Since NUMA stats does not be read by users frequently, and kernel does not
      need it to make a decision, it will not be a problem to make the readers
      more expensive.
      
      Link: http://lkml.kernel.org/r/1503568801-21305-4-git-send-email-kemi.wang@intel.comSigned-off-by: 's avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: 's avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: 's avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      63803222
    • Kemi Wang's avatar
      mm: update NUMA counter threshold size · 1d90ca89
      Kemi Wang authored
      There is significant overhead in cache bouncing caused by zone counters
      (NUMA associated counters) update in parallel in multi-threaded page
      allocation (suggested by Dave Hansen).
      
      This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2,
      as a small threshold greatly increases the update frequency of the global
      counter from local per cpu counter(suggested by Ying Huang).
      
      The rationality is that these statistics counters don't affect the
      kernel's decision, unlike other VM counters, so it's not a problem to use
      a large threshold.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per
      single page allocation and reclaim on Jesper's page_bench03 benchmark.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
      https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
      bench
      
       Threshold   CPU cycles    Throughput(88 threads)
           32          799         241760478
           64          640         301628829
           125         537         358906028 <==> system by default (base)
           256         468         412397590
           512         428         450550704
           4096        399         482520943
           20000       394         489009617
           30000       395         488017817
           65533       369(-31.3%) 521661345(+45.3%) <==> with this patchset
           N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      Link: http://lkml.kernel.org/r/1503568801-21305-3-git-send-email-kemi.wang@intel.comSigned-off-by: 's avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: 's avatarJesper Dangaard Brouer <brouer@redhat.com>
      Suggested-by: 's avatarDave Hansen <dave.hansen@intel.com>
      Suggested-by: 's avatarYing Huang <ying.huang@intel.com>
      Acked-by: 's avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d90ca89
    • Kemi Wang's avatar
      mm: change the call sites of numa statistics items · 3a321d2a
      Kemi Wang authored
      Patch series "Separate NUMA statistics from zone statistics", v2.
      
      Each page allocation updates a set of per-zone statistics with a call to
      zone_statistics().  As discussed in 2017 MM summit, these are a
      substantial source of overhead in the page allocator and are very rarely
      consumed.  This significant overhead in cache bouncing caused by zone
      counters (NUMA associated counters) update in parallel in multi-threaded
      page allocation (pointed out by Dave Hansen).
      
      A link to the MM summit slides:
        http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
      
      To mitigate this overhead, this patchset separates NUMA statistics from
      zone statistics framework, and update NUMA counter threshold to a fixed
      size of MAX_U16 - 2, as a small threshold greatly increases the update
      frequency of the global counter from local per cpu counter (suggested by
      Ying Huang).  The rationality is that these statistics counters don't
      need to be read often, unlike other VM counters, so it's not a problem
      to use a large threshold and make readers more expensive.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
      below) for per single page allocation and reclaim on Jesper's
      page_bench03 benchmark.  Meanwhile, this patchset keeps the same style
      of virtual memory statistics with little end-user-visible effects (only
      move the numa stats to show behind zone page stats, see the first patch
      for details).
      
      I did an experiment of single page allocation and reclaim concurrently
      using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
      server (88 processors with 126G memory) with different size of threshold
      of pcp counter.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
         Threshold   CPU cycles    Throughput(88 threads)
            32        799         241760478
            64        640         301628829
            125       537         358906028 <==> system by default
            256       468         412397590
            512       428         450550704
            4096      399         482520943
            20000     394         489009617
            30000     395         488017817
            65533     369(-31.3%) 521661345(+45.3%) <==> with this patchset
            N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      This patch (of 3):
      
      In this patch, NUMA statistics is separated from zone statistics
      framework, all the call sites of NUMA stats are changed to use
      numa-stats-specific functions, it does not have any functionality change
      except that the number of NUMA stats is shown behind zone page stats
      when users *read* the zone info.
      
      E.g. cat /proc/zoneinfo
          ***Base***                           ***With this patch***
      nr_free_pages 3976                         nr_free_pages 3976
      nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
      nr_zone_active_anon 0                      nr_zone_active_anon 0
      nr_zone_inactive_file 0                    nr_zone_inactive_file 0
      nr_zone_active_file 0                      nr_zone_active_file 0
      nr_zone_unevictable 0                      nr_zone_unevictable 0
      nr_zone_write_pending 0                    nr_zone_write_pending 0
      nr_mlock     0                             nr_mlock     0
      nr_page_table_pages 0                      nr_page_table_pages 0
      nr_kernel_stack 0                          nr_kernel_stack 0
      nr_bounce    0                             nr_bounce    0
      nr_zspages   0                             nr_zspages   0
      numa_hit 0                                *nr_free_cma  0*
      numa_miss 0                                numa_hit     0
      numa_foreign 0                             numa_miss    0
      numa_interleave 0                          numa_foreign 0
      numa_local   0                             numa_interleave 0
      numa_other   0                             numa_local   0
      *nr_free_cma 0*                            numa_other 0
          ...                                        ...
      vm stats threshold: 10                     vm stats threshold: 10
          ...                                        ...
      
      The next patch updates the numa stats counter size and threshold.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.comSigned-off-by: 's avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: 's avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: 's avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a321d2a
    • Anshuman Khandual's avatar
      mm/memory.c: remove reduntant check for write access · fde26bed
      Anshuman Khandual authored
      Flags argument has been copied into vmf.flags and it is not changed in
      between.  Hence a single write access check can be used for both PUD and
      PMD.
      
      Link: http://lkml.kernel.org/r/20170823082839.1812-1-khandual@linux.vnet.ibm.comSigned-off-by: 's avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      fde26bed
    • Andrea Arcangeli's avatar
      userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS · 656710a6
      Andrea Arcangeli authored
      This is an enhancement to avoid a non cooperative userfaultfd manager
      having to unregister all regions before it can close the uffd after all
      userfaultfd activity completed.
      
      The UFFDIO_UNREGISTER would serialize against the handle_userfault by
      taking the mmap_sem for writing, but we can simply repeat the page fault
      if we detect the uffd was closed and so the regular page fault paths
      should takeover.
      
      Link: http://lkml.kernel.org/r/20170823181227.19926-1-aarcange@redhat.comSigned-off-by: 's avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: 's avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      656710a6
    • Laurent Dufour's avatar
      mm: remove useless vma parameter to offset_il_node · 98c70baa
      Laurent Dufour authored
      While reading the code I found that offset_il_node() has a vm_area_struct
      pointer parameter which is unused.
      
      Link: http://lkml.kernel.org/r/1502899755-23146-1-git-send-email-ldufour@linux.vnet.ibm.comSigned-off-by: 's avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      98c70baa
    • Jérôme Glisse's avatar
      mm/hmm: fix build when HMM is disabled · de540a97
      Jérôme Glisse authored
      Combinatorial Kconfig is painfull. Withi this patch all below combination
      build.
      
      1)
      
      2)
      CONFIG_HMM_MIRROR=y
      
      3)
      CONFIG_DEVICE_PRIVATE=y
      
      4)
      CONFIG_DEVICE_PUBLIC=y
      
      5)
      CONFIG_HMM_MIRROR=y
      CONFIG_DEVICE_PUBLIC=y
      
      6)
      CONFIG_HMM_MIRROR=y
      CONFIG_DEVICE_PRIVATE=y
      
      7)
      CONFIG_DEVICE_PRIVATE=y
      CONFIG_DEVICE_PUBLIC=y
      
      8)
      CONFIG_HMM_MIRROR=y
      CONFIG_DEVICE_PRIVATE=y
      CONFIG_DEVICE_PUBLIC=y
      
      Link: http://lkml.kernel.org/r/20170826002149.20919-1-jglisse@redhat.comReported-by: 's avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      de540a97
    • Jérôme Glisse's avatar
      mm/hmm: avoid bloating arch that do not make use of HMM · 6b368cd4
      Jérôme Glisse authored
      This moves all new code including new page migration helper behind kernel
      Kconfig option so that there is no codee bloat for arch or user that do
      not want to use HMM or any of its associated features.
      
      arm allyesconfig (without all the patchset, then with and this patch):
         text	   data	    bss	    dec	    hex	filename
      83721896	46511131	27582964	157815991	96814b7	../without/vmlinux
      83722364	46511131	27582964	157816459	968168b	vmlinux
      
      [jglisse@redhat.com: struct hmm is only use by HMM mirror functionality]
        Link: http://lkml.kernel.org/r/20170825213133.27286-1-jglisse@redhat.com
      [sfr@canb.auug.org.au: fix build (arm multi_v7_defconfig)]
        Link: http://lkml.kernel.org/r/20170828181849.323ab81b@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170818032858.7447-1-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: 's avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b368cd4
    • Jérôme Glisse's avatar
      mm/hmm: add new helper to hotplug CDM memory region · d3df0a42
      Jérôme Glisse authored
      Unlike unaddressable memory, coherent device memory has a real resource
      associated with it on the system (as CPU can address it).  Add a new
      helper to hotplug such memory within the HMM framework.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-20-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: 's avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3df0a42
    • Jérôme Glisse's avatar
      mm/device-public-memory: device memory cache coherent with CPU · df6ad698
      Jérôme Glisse authored
      Platform with advance system bus (like CAPI or CCIX) allow device memory
      to be accessible from CPU in a cache coherent fashion.  Add a new type of
      ZONE_DEVICE to represent such memory.  The use case are the same as for
      the un-addressable device memory but without all the corners cases.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      df6ad698
    • Jérôme Glisse's avatar
      mm/migrate: allow migrate_vma() to alloc new page on empty entry · 8315ada7
      Jérôme Glisse authored
      This allows callers of migrate_vma() to allocate new page for empty CPU
      page table entry (pte_none or back by zero page).  This is only for
      anonymous memory and it won't allow new page to be instanced if the
      userfaultfd is armed.
      
      This is useful to device driver that want to migrate a range of virtual
      address and would rather allocate new memory than having to fault later
      on.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-18-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      8315ada7
    • Jérôme Glisse's avatar
      mm/migrate: support un-addressable ZONE_DEVICE page in migration · a5430dda
      Jérôme Glisse authored
      Allow to unmap and restore special swap entry of un-addressable
      ZONE_DEVICE memory.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-17-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5430dda
    • Jérôme Glisse's avatar
      mm/migrate: migrate_vma() unmap page from vma while collecting pages · 8c3328f1
      Jérôme Glisse authored
      Common case for migration of virtual address range is page are map only
      once inside the vma in which migration is taking place.  Because we
      already walk the CPU page table for that range we can directly do the
      unmap there and setup special migration swap entry.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-16-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: 's avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: 's avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: 's avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: 's avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: 's avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c3328f1
    • Jérôme Glisse's avatar
      mm/migrate: new memory migration helper for use with device memory · 8763cb45
      Jérôme Glisse authored
      This patch add a new memory migration helpers, which migrate memory
      backing a range of virtual address of a process to different memory (which
      can be allocated through special allocator).  It differs from numa
      migration by working on a range of virtual address and thus by doing
      migration in chunk that can be large enough to use DMA engine or special
      copy offloading engine.
      
      Expected users are any one with heterogeneous memory where different
      memory have different characteristics (latency, bandwidth, ...).  As an
      example IBM platform with CAPI bus can make use of this feature to migrate
      between regular memory and CAPI device memory.  New CPU architecture with
      a pool of high performance memory not manage as cache but presented as
      regular memory (while being faster and with lower latency than DDR) will
      also be prime user of this patch.
      
      Migration to private device memory will be useful for device that have
      large pool of such like GPU, NVidia plans to use HMM for that.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-15-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: 's avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: 's avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: 's avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: 's avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: 's avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      8763cb45
    • Jérôme Glisse's avatar
      mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY · 2916ecc0
      Jérôme Glisse authored
      Introduce a new migration mode that allow to offload the copy to a device
      DMA engine.  This changes the workflow of migration and not all
      address_space migratepage callback can support this.
      
      This is intended to be use by migrate_vma() which itself is use for thing
      like HMM (see include/linux/hmm.h).
      
      No additional per-filesystem migratepage testing is needed.  I disables
      MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
      added comment in those to explain why (part of this patch).  The commit
      message is unclear it should say that any callback that wish to support
      this new mode need to be aware of the difference in the migration flow
      from other mode.
      
      Some of these callbacks do extra locking while copying (aio, zsmalloc,
      balloon, ...) and for DMA to be effective you want to copy multiple
      pages in one DMA operations.  But in the problematic case you can not
      easily hold the extra lock accross multiple call to this callback.
      
      Usual flow is:
      
      For each page {
       1 - lock page
       2 - call migratepage() callback
       3 - (extra locking in some migratepage() callback)
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
       5 - copy page
       6 - (unlock any extra lock of migratepage() callback)
       7 - return from migratepage() callback
       8 - unlock page
      }
      
      The new mode MIGRATE_SYNC_NO_COPY:
       1 - lock multiple pages
      For each page {
       2 - call migratepage() callback
       3 - abort in all problematic migratepage() callback
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
      } // finished all calls to migratepage() callback
       5 - DMA copy multiple pages
       6 - unlock all the pages
      
      To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
      new callback migratepages() (for instance) that deals with multiple
      pages in one transaction.
      
      Because the problematic cases are not important for current usage I did
      not wanted to complexify this patchset even more for no good reason.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      2916ecc0
    • Jérôme Glisse's avatar
      mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory · 858b54da
      Jérôme Glisse authored
      This introduce a dummy HMM device class so device driver can use it to
      create hmm_device for the sole purpose of registering device memory.  It
      is useful to device driver that want to manage multiple physical device
      memory under same struct device umbrella.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-13-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: 's avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: 's avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: 's avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: 's avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: 's avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      858b54da
    • Jérôme Glisse's avatar
      mm/hmm/devmem: device memory hotplug using ZONE_DEVICE · 4ef589dc
      Jérôme Glisse authored
      This introduce a simple struct and associated helpers for device driver to
      use when hotpluging un-addressable device memory as ZONE_DEVICE.  It will
      find a unuse physical address range and trigger memory hotplug for it
      which allocates and initialize struct page for the device memory.
      
      Device driver should use this helper during device initialization to
      hotplug the device memory.  It should only need to remove the memory once
      the device is going offline (shutdown or hotremove).  There should not be
      any userspace API to hotplug memory expect maybe for host device driver to
      allow to add more memory to a guest device driver.
      
      Device's memory is manage by the device driver and HMM only provides
      helpers to that effect.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-12-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: 's avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: 's avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: 's avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: 's avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: 's avatarSubhash Gutti <sgutti@nvidia.com>
      Signed-off-by: 's avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ef589dc
    • Jérôme Glisse's avatar
      mm/memcontrol: support MEMORY_DEVICE_PRIVATE · c733a828
      Jérôme Glisse authored
      HMM pages (private or public device pages) are ZONE_DEVICE page and thus
      need special handling when it comes to lru or refcount.  This patch make
      sure that memcontrol properly handle those when it face them.  Those pages
      are use like regular pages in a process address space either as anonymous
      page or as file back page.  So from memcg point of view we want to handle
      them like regular page for now at least.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-11-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: 's avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      c733a828
    • Jérôme Glisse's avatar
      mm/memcontrol: allow to uncharge page without using page->lru field · a9d5adee
      Jérôme Glisse authored
      HMM pages (private or public device pages) are ZONE_DEVICE page and
      thus you can not use page->lru fields of those pages. This patch
      re-arrange the uncharge to allow single page to be uncharge without
      modifying the lru field of the struct page.
      
      There is no change to memcontrol logic, it is the same as it was
      before this patch.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-10-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9d5adee
    • Jérôme Glisse's avatar
      mm/ZONE_DEVICE: special case put_page() for device private pages · 7b2d55d2
      Jérôme Glisse authored
      A ZONE_DEVICE page that reach a refcount of 1 is free ie no longer have
      any user.  For device private pages this is important to catch and thus we
      need to special case put_page() for this.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-9-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b2d55d2
    • Jérôme Glisse's avatar
      mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory · 5042db43
      Jérôme Glisse authored
      HMM (heterogeneous memory management) need struct page to support
      migration from system main memory to device memory.  Reasons for HMM and
      migration to device memory is explained with HMM core patch.
      
      This patch deals with device memory that is un-addressable memory (ie CPU
      can not access it).  Hence we do not want those struct page to be manage
      like regular memory.  That is why we extend ZONE_DEVICE to support
      different types of memory.
      
      A persistent memory type is define for existing user of ZONE_DEVICE and a
      new device un-addressable type is added for the un-addressable memory
      type.  There is a clear separation between what is expected from each
      memory type and existing user of ZONE_DEVICE are un-affected by new
      requirement and new use of the un-addressable type.  All specific code
      path are protect with test against the memory type.
      
      Because memory is un-addressable we use a new special swap type for when a
      page is migrated to device memory (this reduces the number of maximum swap
      file).
      
      The main two additions beside memory type to ZONE_DEVICE is two callbacks.
      First one, page_free() is call whenever page refcount reach 1 (which
      means the page is free as ZONE_DEVICE page never reach a refcount of 0).
      This allow device driver to manage its memory and associated struct page.
      
      The second callback page_fault() happens when there is a CPU access to an
      address that is back by a device page (which are un-addressable by the
      CPU).  This callback is responsible to migrate the page back to system
      main memory.  Device driver can not block migration back to system memory,
      HMM make sure that such page can not be pin into device memory.
      
      If device is in some error condition and can not migrate memory back then
      a CPU page fault to device memory should end with SIGBUS.
      
      [arnd@arndb.de: fix warning]
        Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: 's avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: 's avatarDan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      5042db43
    • Michal Hocko's avatar
      mm/memory_hotplug: introduce add_pages · 3072e413
      Michal Hocko authored
      There are new users of memory hotplug emerging.  Some of them require
      different subset of arch_add_memory.  There are some which only require
      allocation of struct pages without mapping those pages to the kernel
      address space.  We currently have __add_pages for that purpose.  But this
      is rather lowlevel and not very suitable for the code outside of the
      memory hotplug.  E.g.  x86_64 wants to update max_pfn which should be done
      by the caller.  Introduce add_pages() which should care about those
      details if they are needed.  Each architecture should define its
      implementation and select CONFIG_ARCH_HAS_ADD_PAGES.  All others use the
      currently existing __add_pages.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-7-jglisse@redhat.comSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: 's avatarBalbir Singh <bsingharora@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      3072e413
    • Jérôme Glisse's avatar
      mm/hmm/mirror: device page fault handler · 74eee180
      Jérôme Glisse authored
      This handles page fault on behalf of device driver, unlike
      handle_mm_fault() it does not trigger migration back to system memory for
      device memory.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-6-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: 's avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: 's avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: 's avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: 's avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: 's avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      74eee180
    • Jérôme Glisse's avatar
      mm/hmm/mirror: helper to snapshot CPU page table · da4c3c73
      Jérôme Glisse authored
      This does not use existing page table walker because we want to share
      same code for our page fault handler.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-5-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: 's avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: 's avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: 's avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: 's avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: 's avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      da4c3c73
    • Jérôme Glisse's avatar
      mm/hmm/mirror: mirror process address space on device with HMM helpers · c0b12405
      Jérôme Glisse authored
      This is a heterogeneous memory management (HMM) process address space
      mirroring.  In a nutshell this provide an API to mirror process address
      space on a device.  This boils down to keeping CPU and device page table
      synchronize (we assume that both device and CPU are cache coherent like
      PCIe device can be).
      
      This patch provide a simple API for device driver to achieve address space
      mirroring thus avoiding each device driver to grow its own CPU page table
      walker and its own CPU page table synchronization mechanism.
      
      This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
      hardware in the future.
      
      [jglisse@redhat.com: fix hmm for "mmu_notifier kill invalidate_page callback"]
        Link: http://lkml.kernel.org/r/20170830231955.GD9445@redhat.com
      Link: http://lkml.kernel.org/r/20170817000548.32038-4-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: 's avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: 's avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: 's avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: 's avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: 's avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0b12405
    • Jérôme Glisse's avatar
      mm/hmm: heterogeneous memory management (HMM for short) · 133ff0ea
      Jérôme Glisse authored
      HMM provides 3 separate types of functionality:
          - Mirroring: synchronize CPU page table and device page table
          - Device memory: allocating struct page for device memory
          - Migration: migrating regular memory to device memory
      
      This patch introduces some common helpers and definitions to all of
      those 3 functionality.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-3-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: 's avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: 's avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: 's avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: 's avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: 's avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      133ff0ea
    • Jérôme Glisse's avatar
      hmm: heterogeneous memory management documentation · bffc33ec
      Jérôme Glisse authored
      Patch series "HMM (Heterogeneous Memory Management)", v25.
      
      Heterogeneous Memory Management (HMM) (description and justification)
      
      Today device driver expose dedicated memory allocation API through their
      device file, often relying on a combination of IOCTL and mmap calls.
      The device can only access and use memory allocated through this API.
      This effectively split the program address space into object allocated
      for the device and useable by the device and other regular memory
      (malloc, mmap of a file, share memory, â) only accessible by
      CPU (or in a very limited way by a device by pinning memory).
      
      Allowing different isolated component of a program to use a device thus
      require duplication of the input data structure using device memory
      allocator.  This is reasonable for simple data structure (array, grid,
      image, â) but this get extremely complex with advance data
      structure (list, tree, graph, â) that rely on a web of memory
      pointers.  This is becoming a serious limitation on the kind of work
      load that can be offloaded to device like GPU.
      
      New industry standard like C++, OpenCL or CUDA are pushing to remove
      this barrier.  This require a shared address space between GPU device
      and CPU so that GPU can access any memory of a process (while still
      obeying memory protection like read only).  This kind of feature is also
      appearing in various other operating systems.
      
      HMM is a set of helpers to facilitate several aspects of address space
      sharing and device memory management.  Unlike existing sharing mechanism
      that rely on pining pages use by a device, HMM relies on mmu_notifier to
      propagate CPU page table update to device page table.
      
      Duplicating CPU page table is only one aspect necessary for efficiently
      using device like GPU.  GPU local memory have bandwidth in the TeraBytes/
      second range but they are connected to main memory through a system bus
      like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x).  Thus it
      is necessary to allow migration of process memory from main system memory
      to device memory.  Issue is that on platform that only have PCIE the
      device memory is not accessible by the CPU with the same properties as
      main memory (cache coherency, atomic operations, ...).
      
      To allow migration from main memory to device memory HMM provides a set of
      helper to hotplug device memory as a new type of ZONE_DEVICE memory which
      is un-addressable by CPU but still has struct page representing it.  This
      allow most of the core kernel logic that deals with a process memory to
      stay oblivious of the peculiarity of device memory.
      
      When page backing an address of a process is migrated to device memory the
      CPU page table entry is set to a new specific swap entry.  CPU access to
      such address triggers a migration back to system memory, just like if the
      page was swap on disk.  HMM also blocks any one from pinning a ZONE_DEVICE
      page so that it can always be migrated back to system memory if CPU access
      it.  Conversely HMM does not migrate to device memory any page that is pin
      in system memory.
      
      To allow efficient migration between device memory and main memory a new
      migrate_vma() helpers is added with this patchset.  It allows to leverage
      device DMA engine to perform the copy operation.
      
      This feature will be use by upstream driver like nouveau mlx5 and probably
      other in the future (amdgpu is next suspect in line).  We are actively
      working on nouveau and mlx5 support.  To test this patchset we also worked
      with NVidia close source driver team, they have more resources than us to
      test this kind of infrastructure and also a bigger and better userspace
      eco-system with various real industry workload they can be use to test and
      profile HMM.
      
      The expected workload is a program builds a data set on the CPU (from
      disk, from network, from sensors, â).  Program uses GPU API (OpenCL,
      CUDA, ...) to give hint on memory placement for the input data and also
      for the output buffer.  Program call GPU API to schedule a GPU job, this
      happens using device driver specific ioctl.  All this is hidden from
      programmer point of view in case of C++ compiler that transparently
      offload some part of a program to GPU.  Program can keep doing other stuff
      on the CPU while the GPU is crunching numbers.
      
      It is expected that CPU will not access the same data set as the GPU while
      GPU is working on it, but this is not mandatory.  In fact we expect some
      small memory object to be actively access by both GPU and CPU concurrently
      as synchronization channel and/or for monitoring purposes.  Such object
      will stay in system memory and should not be bottlenecked by system bus
      bandwidth (rare write and read access from both CPU and GPU).
      
      As we are relying on device driver API, HMM does not introduce any new
      syscall nor does it modify any existing ones.  It does not change any
      POSIX semantics or behaviors.  For instance the child after a fork of a
      process that is using HMM will not be impacted in anyway, nor is there any
      data hazard between child COW or parent COW of memory that was migrated to
      device prior to fork.
      
      HMM assume a numbers of hardware features.  Device must allow device page
      table to be updated at any time (ie device job must be preemptable).
      Device page table must provides memory protection such as read only.
      Device must track write access (dirty bit).  Device must have a minimum
      granularity that match PAGE_SIZE (ie 4k).
      
      Reviewer (just hint):
      Patch 1  HMM documentation
      Patch 2  introduce core infrastructure and definition of HMM, pretty
               small patch and easy to review
      Patch 3  introduce the mirror functionality of HMM, it relies on
               mmu_notifier and thus someone familiar with that part would be
               in better position to review
      Patch 4  is an helper to snapshot CPU page table while synchronizing with
               concurrent page table update. Understanding mmu_notifier makes
               review easier.
      Patch 5  is mostly a wrapper around handle_mm_fault()
      Patch 6  add new add_pages() helper to avoid modifying each arch memory
               hot plug function
      Patch 7  add a new memory type for ZONE_DEVICE and also add all the logic
               in various core mm to support this new type. Dan Williams and
               any core mm contributor are best people to review each half of
               this patchset
      Patch 8  special case HMM ZONE_DEVICE pages inside put_page() Kirill and
               Dan Williams are best person to review this
      Patch 9  allow to uncharge a page from memory group without using the lru
               list field of struct page (best reviewer: Johannes Weiner or
               Vladimir Davydov or Michal Hocko)
      Patch 10 Add support to uncharge ZONE_DEVICE page from a memory cgroup (best
               reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko)
      Patch 11 add helper to hotplug un-addressable device memory as new type
               of ZONE_DEVICE memory (new type introducted in patch 3 of this
               serie). This is boiler plate code around memory hotplug and it
               also pick a free range of physical address for the device memory.
               Note that the physical address do not point to anything (at least
               as far as the kernel knows).
      Patch 12 introduce a new hmm_device class as an helper for device driver
               that want to expose multiple device memory under a common fake
               device driver. This is usefull for multi-gpu configuration.
               Anyone familiar with device driver infrastructure can review
               this. Boiler plate code really.
      Patch 13 add a new migrate mode. Any one familiar with page migration is
               welcome to review.
      Patch 14 introduce a new migration helper (migrate_vma()) that allow to
               migrate a range of virtual address of a process using device DMA
               engine to perform the copy. It is not limited to do copy from and
               to device but can also do copy between any kind of source and
               destination memory. Again anyone familiar with migration code
               should be able to verify the logic.
      Patch 15 optimize the new migrate_vma() by unmapping pages while we are
               collecting them. This can be review by any mm folks.
      Patch 16 add unaddressable memory migration to helper introduced in patch
               7, this can be review by anyone familiar with migration code
      Patch 17 add a feature that allow device to allocate non-present page on
               the GPU when migrating a range of address to device memory. This
               is an helper for device driver to avoid having to first allocate
               system memory before migration to device memory
      Patch 18 add a new kind of ZONE_DEVICE memory for cache coherent device
               memory (CDM)
      Patch 19 add an helper to hotplug CDM memory
      
      Previous patchset posting :
      v1 http://lwn.net/Articles/597289/
      v2 https://lkml.org/lkml/2014/6/12/559
      v3 https://lkml.org/lkml/2014/6/13/633
      v4 https://lkml.org/lkml/2014/8/29/423
      v5 https://lkml.org/lkml/2014/11/3/759
      v6 http://lwn.net/Articles/619737/
      v7 http://lwn.net/Articles/627316/
      v8 https://lwn.net/Articles/645515/
      v9 https://lwn.net/Articles/651553/
      v10 https://lwn.net/Articles/654430/
      v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
      v12 http://www.kernelhub.org/?msg=972982&p=2
      v13 https://lwn.net/Articles/706856/
      v14 https://lkml.org/lkml/2016/12/8/344
      v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.html
      v16 http://www.spinics.net/lists/linux-mm/msg119814.html
      v17 https://lkml.org/lkml/2017/1/27/847
      v18 https://lkml.org/lkml/2017/3/16/596
      v19 https://lkml.org/lkml/2017/4/5/831
      v20 https://lwn.net/Articles/720715/
      v21 https://lkml.org/lkml/2017/4/24/747
      v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html
      v23 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1404788.html
      v24 https://lwn.net/Articles/726691/
      
      This patch (of 19):
      
      This adds documentation for HMM (Heterogeneous Memory Management).  It
      presents the motivation behind it, the features necessary for it to be
      useful and and gives an overview of how this is implemented.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-2-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      bffc33ec
    • Naoya Horiguchi's avatar
      mm: memory_hotplug: memory hotremove supports thp migration · 8135d892
      Naoya Horiguchi authored
      This patch enables thp migration for memory hotremove.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.comSigned-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: 's avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      8135d892
    • Naoya Horiguchi's avatar
      mm: migrate: move_pages() supports thp migration · e8db67eb
      Naoya Horiguchi authored
      This patch enables thp migration for move_pages(2).
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-10-zi.yan@sent.comSigned-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: 's avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8db67eb
    • Naoya Horiguchi's avatar
      mm: mempolicy: mbind and migrate_pages support thp migration · c8633798
      Naoya Horiguchi authored
      This patch enables thp migration for mbind(2) and migrate_pages(2).
      Signed-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: 's avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8633798
    • Naoya Horiguchi's avatar
      mm: soft-dirty: keep soft-dirty bits over thp migration · ab6e3d09
      Naoya Horiguchi authored
      Soft dirty bit is designed to keep tracked over page migration.  This
      patch makes it work in the same manner for thp migration too.
      Signed-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: 's avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab6e3d09
    • Zi Yan's avatar
      mm: thp: check pmd migration entry in common path · 84c3fc4e
      Zi Yan authored
      When THP migration is being used, memory management code needs to handle
      pmd migration entries properly.  This patch uses !pmd_present() or
      is_swap_pmd() (depending on whether pmd_none() needs separate code or
      not) to check pmd migration entries at the places where a pmd entry is
      present.
      
      Since pmd-related code uses split_huge_page(), split_huge_pmd(),
      pmd_trans_huge(), pmd_trans_unstable(), or
      pmd_none_or_trans_huge_or_clear_bad(), this patch:
      
      1. adds pmd migration entry split code in split_huge_pmd(),
      
      2. takes care of pmd migration entries whenever pmd_trans_huge() is present,
      
      3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.
      
      Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
      is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
      them.
      
      Until this commit, a pmd entry should be:
      1. pointing to a pte page,
      2. is_swap_pmd(),
      3. pmd_trans_huge(),
      4. pmd_devmap(), or
      5. pmd_none().
      Signed-off-by: 's avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      84c3fc4e
    • Zi Yan's avatar
      mm: thp: enable thp migration in generic path · 616b8371
      Zi Yan authored
      Add thp migration's core code, including conversions between a PMD entry
      and a swap entry, setting PMD migration entry, removing PMD migration
      entry, and waiting on PMD migration entries.
      
      This patch makes it possible to support thp migration.  If you fail to
      allocate a destination page as a thp, you just split the source thp as
      we do now, and then enter the normal page migration.  If you succeed to
      allocate destination thp, you enter thp migration.  Subsequent patches
      actually enable thp migration for each caller of page migration by
      allowing its get_new_page() callback to allocate thps.
      
      [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
        Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
      [akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
      Signed-off-by: 's avatarZi Yan <zi.yan@cs.rutgers.edu>
      Acked-by: 's avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      616b8371
    • Naoya Horiguchi's avatar
      mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION · 9c670ea3
      Naoya Horiguchi authored
      Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
      functionality to x86_64, which should be safer at the first step.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-5-zi.yan@sent.comSigned-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: 's avatarZi Yan <zi.yan@cs.rutgers.edu>
      Reviewed-by: 's avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c670ea3
    • Naoya Horiguchi's avatar
      mm: thp: introduce separate TTU flag for thp freezing · b5ff8161
      Naoya Horiguchi authored
      TTU_MIGRATION is used to convert pte into migration entry until thp
      split completes.  This behavior conflicts with thp migration added later
      patches, so let's introduce a new TTU flag specifically for freezing.
      
      try_to_unmap() is used both for thp split (via freeze_page()) and page
      migration (via __unmap_and_move()).  In freeze_page(), ttu_flag given
      for head page is like below (assuming anonymous thp):
      
          (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
           TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)
      
      and ttu_flag given for tail pages is:
      
          (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
           TTU_MIGRATION)
      
      __unmap_and_move() calls try_to_unmap() with ttu_flag:
      
          (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)
      
      Now I'm trying to insert a branch for thp migration at the top of
      try_to_unmap_one() like below
      
      static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
                             unsigned long address, void *arg)
        {
                ...
                /* PMD-mapped THP migration entry */
                if (!pvmw.pte && (flags & TTU_MIGRATION)) {
                    if (!PageAnon(page))
                        continue;
      
                    set_pmd_migration_entry(&pvmw, page);
                    continue;
                }
      	  ...
        }
      
      so try_to_unmap() for tail pages called by thp split can go into thp
      migration code path (which converts *pmd* into migration entry), while
      the expectation is to freeze thp (which converts *pte* into migration
      entry.)
      
      I detected this failure as a "bad page state" error in a testcase where
      split_huge_page() is called from queue_pages_pte_range().
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-4-zi.yan@sent.comSigned-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: 's avatarZi Yan <zi.yan@cs.rutgers.edu>
      Acked-by: 's avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5ff8161
    • Naoya Horiguchi's avatar
      mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1 · eee4818b
      Naoya Horiguchi authored
      _PAGE_PSE is used to distinguish between a truly non-present
      (_PAGE_PRESENT=0) PMD, and a PMD which is undergoing a THP split and
      should be treated as present.
      
      But _PAGE_SWP_SOFT_DIRTY currently uses the _PAGE_PSE bit, which would
      cause confusion between one of those PMDs undergoing a THP split, and a
      soft-dirty PMD.  Dropping _PAGE_PSE check in pmd_present() does not work
      well, because it can hurt optimization of tlb handling in thp split.
      
      Thus, we need to move the bit.
      
      In the current kernel, bits 1-4 are not used in non-present format since
      commit 00839ee3 ("x86/mm: Move swap offset/type up in PTE to work
      around erratum").  So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.  Bit 7
      is used as reserved (always clear), so please don't use it for other
      purpose.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-3-zi.yan@sent.comSigned-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: 's avatarZi Yan <zi.yan@cs.rutgers.edu>
      Acked-by: 's avatarDave Hansen <dave.hansen@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      eee4818b
    • Naoya Horiguchi's avatar
      mm: mempolicy: add queue_pages_required() · 88aaa2a1
      Naoya Horiguchi authored
      Patch series "mm: page migration enhancement for thp", v9.
      
      Motivations:
      
      1. THP migration becomes important in the upcoming heterogeneous memory
         systems. As David Nellans from NVIDIA pointed out from other threads
         (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349227.html),
         future GPUs or other accelerators will have their memory managed by
         operating systems. Moving data into and out of these memory nodes
         efficiently is critical to applications that use GPUs or other
         accelerators. Existing page migration only supports base pages, which
         has a very low memory bandwidth utilization. My experiments (see
         below) show THP migration can migrate pages more efficiently.
      
      2. Base page migration vs THP migration throughput.
      
         Here are cross-socket page migration results from calling
         move_pages() syscall:
      
         In x86_64, a Intel two-socket E5-2640v3 box,
          - single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
          - single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
          - 512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.
      
         In ppc64, a two-socket Power8 box,
          - single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
          - single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
          - 256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.
      
         THP migration can give us 3x and 1.15x throughput over base page
         migration in x86_64 and ppc64 respectivley.
      
         You can test it out by using the code here:
            https://github.com/x-y-z/thp-migration-bench
      
      3. Existing page migration splits THP before migration and cannot
         guarantee the migrated pages are still contiguous. Contiguity is
         always what GPUs and accelerators look for. Without THP migration,
         khugepaged needs to do extra work to reassemble the migrated pages
         back to THPs.
      
      This patch (of 10):
      
      Introduce a separate check routine related to MPOL_MF_INVERT flag.  This
      patch just does cleanup, no behavioral change.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-2-zi.yan@sent.comSigned-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: 's avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      88aaa2a1