1. 12 Jul, 2019 3 commits
  2. 08 Jun, 2019 1 commit
  3. 21 May, 2019 1 commit
  4. 14 May, 2019 6 commits
  5. 28 Dec, 2018 1 commit
  6. 31 Oct, 2018 2 commits
  7. 21 Oct, 2018 1 commit
  8. 20 Sep, 2018 1 commit
    • Pasha Tatashin's avatar
      mm: disable deferred struct page for 32-bit arches · 889c695d
      Pasha Tatashin authored
      Deferred struct page init is needed only on systems with large amount of
      physical memory to improve boot performance.  32-bit systems do not
      benefit from this feature.
      
      Jiri reported a problem where deferred struct pages do not work well with
      x86-32:
      
      [    0.035162] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
      [    0.035725] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
      [    0.036269] Initializing CPU#0
      [    0.036513] Initializing HighMem for node 0 (00036ffe:0007ffe0)
      [    0.038459] page:f6780000 is uninitialized and poisoned
      [    0.038460] raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
      [    0.039509] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
      [    0.040038] ------------[ cut here ]------------
      [    0.040399] kernel BUG at include/linux/page-flags.h:293!
      [    0.040823] invalid opcode: 0000 [#1] SMP PTI
      [    0.041166] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc1_pt_jiri #9
      [    0.041694] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
      [    0.042496] EIP: free_highmem_page+0x64/0x80
      [    0.042839] Code: 13 46 d8 c1 e8 18 5d 83 e0 03 8d 04 c0 c1 e0 06 ff 80 ec 5f 44 d8 c3 8d b4 26 00 00 00 00 ba 08 65 28 d8 89 d8 e8 fc 71 02 00 <0f> 0b 8d 76 00 8d bc 27 00 00 00 00 ba d0 b1 26 d8 89 d8 e8 e4 71
      [    0.044338] EAX: 0000003c EBX: f6780000 ECX: 00000000 EDX: d856cbe8
      [    0.044868] ESI: 0007ffe0 EDI: d838df20 EBP: d838df00 ESP: d838defc
      [    0.045372] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210086
      [    0.045913] CR0: 80050033 CR2: 00000000 CR3: 18556000 CR4: 00040690
      [    0.046413] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      [    0.046913] DR6: fffe0ff0 DR7: 00000400
      [    0.047220] Call Trace:
      [    0.047419]  add_highpages_with_active_regions+0xbd/0x10d
      [    0.047854]  set_highmem_pages_init+0x5b/0x71
      [    0.048202]  mem_init+0x2b/0x1e8
      [    0.048460]  start_kernel+0x1d2/0x425
      [    0.048757]  i386_start_kernel+0x93/0x97
      [    0.049073]  startup_32_smp+0x164/0x168
      [    0.049379] Modules linked in:
      [    0.049626] ---[ end trace 337949378db0abbb ]---
      
      We free highmem pages before their struct pages are initialized:
      
      mem_init()
       set_highmem_pages_init()
        add_highpages_with_active_regions()
         free_highmem_page()
          .. Access uninitialized struct page here..
      
      Because there is no reason to have this feature on 32-bit systems, just
      disable it.
      
      Link: http://lkml.kernel.org/r/20180831150506.31246-1-pavel.tatashin@microsoft.com
      Fixes: 2e3ca40f
      
       ("mm: relax deferred struct page requirements")
      Signed-off-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Reported-by: default avatarJiri Slaby <jslaby@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      889c695d
  9. 17 Aug, 2018 3 commits
    • Huang Ying's avatar
      mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP · 14fef284
      Huang Ying authored
      CONFIG_THP_SWAP should depend on CONFIG_SWAP, because it's unreasonable
      to optimize swapping for THP (Transparent Huge Page) without basic
      swapping support.
      
      In original code, when CONFIG_SWAP=n and CONFIG_THP_SWAP=y,
      split_swap_cluster() will not be built because it is in swapfile.c, but
      it will be called in huge_memory.c.  This doesn't trigger a build error
      in practice because the call site is enclosed by PageSwapCache(), which
      is defined to be constant 0 when CONFIG_SWAP=n.  But this is fragile and
      should be fixed.
      
      The comments are fixed too to reflect the latest progress.
      
      Link: http://lkml.kernel.org/r/20180713021228.439-1-ying.huang@intel.com
      Fixes: 38d8b4e6
      
       ("mm, THP, swap: delay splitting THP during swap out")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14fef284
    • Pavel Tatashin's avatar
      mm/sparse: delete old sparse_init and enable new one · 2a3cb8ba
      Pavel Tatashin authored
      Rename new_sparse_init() to sparse_init() which enables it.  Delete old
      sparse_init() and all the code that became obsolete with.
      
      [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
        Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Tested-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a3cb8ba
    • Mike Rapoport's avatar
      mm: make DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM · d39f8fb4
      Mike Rapoport authored
      The deferred memory initialization relies on section definitions, e.g
      PAGES_PER_SECTION, that are only available when CONFIG_SPARSEMEM=y on
      most architectures.
      
      Initially DEFERRED_STRUCT_PAGE_INIT depended on explicit
      ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT configuration option, but since
      the commit 2e3ca40f ("mm: relax deferred struct page
      requirements") this requirement was relaxed and now it is possible to
      enable DEFERRED_STRUCT_PAGE_INIT on architectures that support
      DISCONTINGMEM and NO_BOOTMEM which causes build failures.
      
      For instance, setting SMP=y and DEFERRED_STRUCT_PAGE_INIT=y on arc
      causes the following build failure:
      
          CC      mm/page_alloc.o
        mm/page_alloc.c: In function 'update_defer_init':
        mm/page_alloc.c:321:14: error: 'PAGES_PER_SECTION'
        undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
              (pfn & (PAGES_PER_SECTION - 1)) == 0) {
                      ^~~~~~~~~~~~~~~~~
                      USEC_PER_SEC
        mm/page_alloc.c:321:14: note: each undeclared identifier is reported only once for each function it appears in
        In file included from include/linux/cache.h:5:0,
                         from include/linux/printk.h:9,
                         from include/linux/kernel.h:14,
                         from include/asm-generic/bug.h:18,
                         from arch/arc/include/asm/bug.h:32,
                         from include/linux/bug.h:5,
                         from include/linux/mmdebug.h:5,
                         from include/linux/mm.h:9,
                         from mm/page_alloc.c:18:
        mm/page_alloc.c: In function 'deferred_grow_zone':
        mm/page_alloc.c:1624:52: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
          unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
                                                            ^
        include/uapi/linux/kernel.h:11:47: note: in definition of macro '__ALIGN_KERNEL_MASK'
         #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
                                                       ^~~~
        include/linux/kernel.h:58:22: note: in expansion of macro '__ALIGN_KERNEL'
         #define ALIGN(x, a)  __ALIGN_KERNEL((x), (a))
                              ^~~~~~~~~~~~~~
        mm/page_alloc.c:1624:34: note: in expansion of macro 'ALIGN'
          unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
                                          ^~~~~
        In file included from include/asm-generic/bug.h:18:0,
                         from arch/arc/include/asm/bug.h:32,
                         from include/linux/bug.h:5,
                         from include/linux/mmdebug.h:5,
                         from include/linux/mm.h:9,
                         from mm/page_alloc.c:18:
        mm/page_alloc.c: In function 'free_area_init_node':
        mm/page_alloc.c:6379:50: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                                          ^
        include/linux/kernel.h:812:22: note: in definition of macro '__typecheck'
           (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                              ^
        include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp'
          __builtin_choose_expr(__safe_cmp(x, y), \
                                ^~~~~~~~~~
        include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
         #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
                                   ^~~~~~~~~~~~~
        mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                     ^~~~~
        include/linux/kernel.h:836:2: error: first argument to '__builtin_choose_expr' not a constant
          __builtin_choose_expr(__safe_cmp(x, y), \
          ^
        include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
         #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
                                   ^~~~~~~~~~~~~
        mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                     ^~~~~
        scripts/Makefile.build:317: recipe for target 'mm/page_alloc.o' failed
      
      Let's make the DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM
      as the systems that support DISCONTIGMEM do not seem to have that huge
      amounts of memory that would make DEFERRED_STRUCT_PAGE_INIT relevant.
      
      Link: http://lkml.kernel.org/r/1530279308-24988-1-git-send-email-rppt@linux.vnet.ibm.com
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d39f8fb4
  10. 01 Aug, 2018 1 commit
  11. 08 Jun, 2018 1 commit
    • Laurent Dufour's avatar
      mm: introduce ARCH_HAS_PTE_SPECIAL · 3010a5ea
      Laurent Dufour authored
      Currently the PTE special supports is turned on in per architecture
      header files.  Most of the time, it is defined in
      arch/*/include/asm/pgtable.h depending or not on some other per
      architecture static definition.
      
      This patch introduce a new configuration variable to manage this
      directly in the Kconfig files.  It would later replace
      __HAVE_ARCH_PTE_SPECIAL.
      
      Here notes for some architecture where the definition of
      __HAVE_ARCH_PTE_SPECIAL is not obvious:
      
      arm
       __HAVE_ARCH_PTE_SPECIAL which is currently defined in
      arch/arm/include/asm/pgtable-3level.h which is included by
      arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
      So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
      
      powerpc
      __HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
       - arch/powerpc/include/asm/book3s/64/pgtable.h
       - arch/powerpc/include/asm/pte-common.h
      The first one is included if (PPC_BOOK3S & PPC64) while the second is
      included in all the other cases.
      So select ARCH_HAS_PTE_SPECIAL all the time.
      
      sparc:
      __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
      defined(__arch64__) which are defined through the compiler in
      sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
      So select ARCH_HAS_PTE_SPECIAL if SPARC64
      
      There is no functional change introduced by this patch.
      
      Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com
      
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Suggested-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Albert Ou <albert@sifive.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Christophe LEROY <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3010a5ea
  12. 22 May, 2018 1 commit
    • Dan Williams's avatar
      mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS · e7638488
      Dan Williams authored
      
      
      In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
      be able to rely on the fact that they will get wakeups on dev_pagemap
      page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
      generic_dax_page_free() as common indicator / infrastructure for dax
      filesytems to require. With this change there are no users of the
      MEMORY_DEVICE_HOST designation, so remove it.
      
      The HMM sub-system extended dev_pagemap to arrange a callback when a
      dev_pagemap managed page is freed. Since a dev_pagemap page is free /
      idle when its reference count is 1 it requires an additional branch to
      check the page-type at put_page() time. Given put_page() is a hot-path
      we do not want to incur that check if HMM is not in use, so a static
      branch is used to avoid that overhead when not necessary.
      
      Now, the FS_DAX implementation wants to reuse this mechanism for
      receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
      static-key into a generic mechanism that either HMM or FS_DAX code paths
      can enable.
      
      For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
      care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
      However, we still need to support FS_DAX in the FS_DAX_LIMITED case
      implemented by the s390/dcssblk driver.
      
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Reported-by: default avatarThomas Meyer <thomas@m3y3r.de>
      Reported-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      e7638488
  13. 19 May, 2018 1 commit
  14. 09 May, 2018 1 commit
  15. 27 Apr, 2018 1 commit
  16. 16 Apr, 2018 1 commit
  17. 26 Mar, 2018 1 commit
  18. 23 Feb, 2018 1 commit
  19. 01 Feb, 2018 1 commit
  20. 18 Nov, 2017 1 commit
  21. 09 Sep, 2017 6 commits
  22. 07 Sep, 2017 1 commit
  23. 10 Jul, 2017 1 commit
  24. 06 Jul, 2017 2 commits
    • Michal Hocko's avatar
      mm, memory_hotplug: drop CONFIG_MOVABLE_NODE · f70029bb
      Michal Hocko authored
      Commit 20b2f52b ("numa: add CONFIG_MOVABLE_NODE for
      movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
      good explanation on why it is actually useful.
      
      It makes a lot of sense to make movable node semantic opt in but we
      already have that because the feature has to be explicitly enabled on
      the kernel command line.  A config option on top only makes the
      configuration space larger without a good reason.  It also adds an
      additional ifdefery that pollutes the code.
      
      Just drop the config option and make it de-facto always enabled.  This
      shouldn't introduce any change to the semantic.
      
      Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f70029bb
    • Huang Ying's avatar
      mm, THP, swap: delay splitting THP during swap out · 38d8b4e6
      Huang Ying authored
      Patch series "THP swap: Delay splitting THP during swapping out", v11.
      
      This patchset is to optimize the performance of Transparent Huge Page
      (THP) swap.
      
      Recently, the performance of the storage devices improved so fast that
      we cannot saturate the disk bandwidth with single logical CPU when do
      page swap out even on a high-end server machine.  Because the
      performance of the storage device improved faster than that of single
      logical CPU.  And it seems that the trend will not change in the near
      future.  On the other hand, the THP becomes more and more popular
      because of increased memory size.  So it becomes necessary to optimize
      THP swap performance.
      
      The advantages of the THP swap support include:
      
       - Batch the swap operations for the THP to reduce lock
         acquiring/releasing, including allocating/freeing the swap space,
         adding/deleting to/from the swap cache, and writing/reading the swap
         space, etc. This will help improve the performance of the THP swap.
      
       - The THP swap space read/write will be 2M sequential IO. It is
         particularly helpful for the swap read, which are usually 4k random
         IO. This will improve the performance of the THP swap too.
      
       - It will help the memory fragmentation, especially when the THP is
         heavily used by the applications. The 2M continuous pages will be
         free up after THP swapping out.
      
       - It will improve the THP utilization on the system with the swap
         turned on. Because the speed for khugepaged to collapse the normal
         pages into the THP is quite slow. After the THP is split during the
         swapping out, it will take quite long time for the normal pages to
         collapse back into the THP after being swapped in. The high THP
         utilization helps the efficiency of the page based memory management
         too.
      
      There are some concerns regarding THP swap in, mainly because possible
      enlarged read/write IO size (for swap in/out) may put more overhead on
      the storage device.  To deal with that, the THP swap in should be turned
      on only when necessary.  For example, it can be selected via
      "always/never/madvise" logic, to be turned on globally, turned off
      globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
      
      This patchset is the first step for the THP swap support.  The plan is
      to delay splitting THP step by step, finally avoid splitting THP during
      the THP swapping out and swap out/in the THP as a whole.
      
      As the first step, in this patchset, the splitting huge page is delayed
      from almost the first step of swapping out to after allocating the swap
      space for the THP and adding the THP into the swap cache.  This will
      reduce lock acquiring/releasing for the locks used for the swap cache
      management.
      
      With the patchset, the swap out throughput improves 15.5% (from about
      3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
      with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      
      This patch (of 5):
      
      In this patch, splitting huge page is delayed from almost the first step
      of swapping out to after allocating the swap space for the THP
      (Transparent Huge Page) and adding the THP into the swap cache.  This
      will batch the corresponding operation, thus improve THP swap out
      throughput.
      
      This is the first step for the THP swap optimization.  The plan is to
      delay splitting the THP step by step and avoid splitting the THP
      finally.
      
      In this patch, one swap cluster is used to hold the contents of each THP
      swapped out.  So, the size of the swap cluster is changed to that of the
      THP (Transparent Huge Page) on x86_64 architecture (512).  For other
      architectures which want such THP swap optimization,
      ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
      the architecture.  In effect, this will enlarge swap cluster size by 2
      times on x86_64.  Which may make it harder to find a free cluster when
      the swap space becomes fragmented.  So that, this may reduce the
      continuous swap space allocation and sequential write in theory.  The
      performance test in 0day shows no regressions caused by this.
      
      In the future of THP swap optimization, some information of the swapped
      out THP (such as compound map count) will be recorded in the
      swap_cluster_info data structure.
      
      The mem cgroup swap accounting functions are enhanced to support charge
      or uncharge a swap cluster backing a THP as a whole.
      
      The swap cluster allocate/free functions are added to allocate/free a
      swap cluster for a THP.  A fair simple algorithm is used for swap
      cluster allocation, that is, only the first swap device in priority list
      will be tried to allocate the swap cluster.  The function will fail if
      the trying is not successful, and the caller will fallback to allocate a
      single swap slot instead.  This works good enough for normal cases.  If
      the difference of the number of the free swap clusters among multiple
      swap devices is significant, it is possible that some THPs are split
      earlier than necessary.  For example, this could be caused by big size
      difference among multiple swap devices.
      
      The swap cache functions is enhanced to support add/delete THP to/from
      the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
      enhanced in the future with multi-order radix tree.  But because we will
      split the THP soon during swapping out, that optimization doesn't make
      much sense for this first step.
      
      The THP splitting functions are enhanced to support to split THP in swap
      cache during swapping out.  The page lock will be held during allocating
      the swap cluster, adding the THP into the swap cache and splitting the
      THP.  So in the code path other than swapping out, if the THP need to be
      split, the PageSwapCache(THP) will be always false.
      
      The swap cluster is only available for SSD, so the THP swap optimization
      in this patchset has no effect for HDD.
      
      [ying.huang@intel.com: fix two issues in THP optimize patch]
        Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
      [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
      Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
      Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38d8b4e6