1. 10 Jul, 2017 6 commits
    • John Hubbard's avatar
      mm/memory_hotplug.c: remove unused local zone_type from __remove_zone() · a52149f1
      John Hubbard authored
      __remove_zone() sets up up zone_type, but never uses it for anything.
      This does not cause a warning, due to the (necessary) use of
      -Wno-unused-but-set-variable.  However, it's noise, so just delete it.
      
      Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.com
      
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a52149f1
    • Michal Hocko's avatar
      mm: unify new_node_page and alloc_migrate_target · 8b913238
      Michal Hocko authored
      Commit 394e31d2 ("mem-hotplug: alloc new page from a nearest
      neighbor node when mem-offline") has duplicated a large part of
      alloc_migrate_target with some hotplug specific special casing.
      
      To be more precise it tried to enfore the allocation from a different
      node than the original page.  As a result the two function diverged in
      their shared logic, e.g.  the hugetlb allocation strategy.
      
      Let's unify the two and express different NUMA requirements by the given
      nodemask.  new_node_page will simply exclude the node it doesn't care
      about and alloc_migrate_target will use all the available nodes.
      alloc_migrate_target will then learn to migrate hugetlb pages more
      sanely and use preallocated pool when possible.
      
      Please note that alloc_migrate_target used to call alloc_page resp.
      alloc_pages_current so the memory policy of the current context which is
      quite strange when we consider that it is used in the context of
      alloc_contig_range which just tries to migrate pages which stand in the
      way.
      
      Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b913238
    • Michal Hocko's avatar
      hugetlb, memory_hotplug: prefer to use reserved pages for migration · 4db9b2ef
      Michal Hocko authored
      new_node_page will try to use the origin's next NUMA node as the
      migration destination for hugetlb pages.  If such a node doesn't have
      any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
      to allocate a surplus page instead.  This is quite subotpimal for any
      configuration when hugetlb pages are no distributed to all NUMA nodes
      evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
      node 0
      
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
      
      Now we consume the whole pool on node 4 and try to offline this node.
      All the allocated pages should be moved to node0 which has enough
      preallocated pages to hold them.  With the current implementation
      offlining very likely fails because hugetlb allocations during runtime
      are much less reliable.
      
      Fix this by reusing the nodemask which excludes migration source and try
      to find a first node which has a page in the preallocated pool first and
      fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
      consumed.
      
      [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
      Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4db9b2ef
    • Michal Hocko's avatar
      mm, memory_hotplug: simplify empty node mask handling in new_node_page · 7f252f27
      Michal Hocko authored
      new_node_page tries to allocate the target page on a different NUMA node
      than the source page.  This makes sense in most cases during the hotplug
      because we are likely to offline the whole numa node.  But there are
      cases where there are no other nodes to fallback (e.g.  when offlining
      parts of the only existing node) and we have to fallback to allocating
      from the source node.  The current code does that but it can be
      simplified by checking the nmask and updating it before we even try to
      allocate rather than special casing it.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f252f27
    • Michal Hocko's avatar
      mm, memory_hotplug: support movable_node for hotpluggable nodes · 9f123ab5
      Michal Hocko authored
      movable_node kernel parameter allows making hotpluggable NUMA nodes to
      put all the hotplugable memory into movable zone which allows more or
      less reliable memory hotremove.  At least this is the case for the NUMA
      nodes present during the boot (see find_zone_movable_pfns_for_nodes).
      
      This is not the case for the memory hotplug, though.
      
      	echo online > /sys/devices/system/memory/memoryXYZ/state
      
      will default to a kernel zone (usually ZONE_NORMAL) unless the
      particular memblock is already in the movable zone range which is not
      the case normally when onlining the memory from the udev rule context
      for a freshly hotadded NUMA node.  The only option currently is to have
      a special udev rule to echo online_movable to all memblocks belonging to
      such a node which is rather clumsy.  Not to mention this is inconsistent
      as well because what ended up in the movable zone during the boot will
      end up in a kernel zone after hotremove & hotadd without special care.
      
      It would be nice to reuse memblock_is_hotpluggable but the runtime
      hotplug doesn't have that information available because the boot and
      hotplug paths are not shared and it would be really non trivial to make
      them use the same code path because the runtime hotplug doesn't play
      with the memblock allocator at all.
      
      Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
      movable_node is enabled and the range doesn't overlap with the existing
      normal zone.  This should provide a reasonable default onlining
      strategy.
      
      Strictly speaking the semantic is not identical with the boot time
      initialization because find_zone_movable_pfns_for_nodes covers only the
      hotplugable range as described by the BIOS/FW.  From my experience this
      is usually a full node though (except for Node0 which is special and
      never goes away completely).  If this turns out to be a problem in the
      real life we can tweak the code to store hotplug flag into memblocks but
      let's keep this simple now.
      
      Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f123ab5
    • Gustavo A. R. Silva's avatar
      mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereference · dbac61a3
      Gustavo A. R. Silva authored
      The NULL check at line 1226: if (!pgdat), implies that pointer pgdat
      might be NULL.
      
      rollback_node_hotadd() dereferences this pointer.  Add NULL check to
      avoid a potential NULL pointer dereference.
      
      Addresses-Coverity-ID: 1369133
      Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgus
      
      
      Signed-off-by: default avatarGustavo A. R. Silva <garsilva@embeddedor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbac61a3
  2. 06 Jul, 2017 15 commits
    • Michal Hocko's avatar
      mm, memory_hotplug: move movable_node to the hotplug proper · 4932381e
      Michal Hocko authored
      movable_node_is_enabled is defined in memblock proper while it is
      initialized from the memory hotplug proper.  This is quite messy and it
      makes a dependency between the two so move movable_node along with the
      helper functions to memory_hotplug.
      
      To make it more entertaining the kernel parameter is ignored unless
      CONFIG_HAVE_MEMBLOCK_NODE_MAP=y because we do not have the node
      information for each memblock otherwise.  So let's warn when the option
      is disabled.
      
      Link: http://lkml.kernel.org/r/20170529114141.536-4-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4932381e
    • Michal Hocko's avatar
      mm, memory_hotplug: drop CONFIG_MOVABLE_NODE · f70029bb
      Michal Hocko authored
      Commit 20b2f52b ("numa: add CONFIG_MOVABLE_NODE for
      movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
      good explanation on why it is actually useful.
      
      It makes a lot of sense to make movable node semantic opt in but we
      already have that because the feature has to be explicitly enabled on
      the kernel command line.  A config option on top only makes the
      configuration space larger without a good reason.  It also adds an
      additional ifdefery that pollutes the code.
      
      Just drop the config option and make it de-facto always enabled.  This
      shouldn't introduce any change to the semantic.
      
      Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f70029bb
    • Michal Hocko's avatar
      mm, memory_hotplug: drop artificial restriction on online/offline · 57c0a172
      Michal Hocko authored
      Patch series "remove CONFIG_MOVABLE_NODE".
      
      I am continuing to clean up the memory hotplug code and
      CONFIG_MOVABLE_NODE seems dubious at best.  The following two patches
      simply removes the flag and make it de-facto always enabled.
      
      The current semantic of the config option is twofold 1) it automatically
      binds hotplugable nodes to have memory in zone_movable by default when
      movable_node is enabled 2) forbids memory hotplug to online all the
      memory as movable when !CONFIG_MOVABLE_NODE.
      
      The later restriction is quite dubious because there is no clear cut of
      how much normal memory do we need for a reasonable system operation.  A
      single memory block which is sufficient to allow further movable onlines
      is far from sufficient (e.g a node with >2GB and memblocks 128MB will
      fill up this zone with struct pages leaving nothing for other
      allocations).  Removing the config option will not only reduce the
      configuration space it also removes quite some code.
      
      The semantic of the movable_node command line parameter is preserved.
      
      The first patch removes the restriction mentioned above and the second
      one simply removes all the CONFIG_MOVABLE_NODE related stuff.  The last
      patch moves movable_node flag handling to memory_hotplug proper where it
      belongs.
      
      [1] http://lkml.kernel.org/r/20170524122411.25212-1-mhocko@kernel.org
      
      This patch (of 3):
      
      Commit 74d42d8f ("memory_hotplug: ensure every online node has
      NORMAL memory") has introduced a restriction that every numa node has to
      have at least some memory in !movable zones before a first movable
      memory can be onlined if !CONFIG_MOVABLE_NODE.
      
      Likewise can_offline_normal checks the amount of normal memory in
      !movable zones and it disallows to offline memory if there is no normal
      memory left with a justification that "memory-management acts bad when
      we have nodes which is online but don't have any normal memory".
      
      While it is true that not having _any_ memory for kernel allocations on
      a NUMA node is far from great and such a node would be quite subotimal
      because all kernel allocations will have to fallback to another NUMA
      node but there is no reason to disallow such a configuration in
      principle.
      
      Besides that there is not really a big difference to have one memblock
      for ZONE_NORMAL available or none.  With 128MB size memblocks the system
      might trash on the kernel allocations requests anyway.  It is really
      hard to draw a line on how much normal memory is really sufficient so we
      have to rely on administrator to configure system sanely therefore drop
      the artificial restriction and remove can_offline_normal and
      can_online_high_movable altogether.
      
      Link: http://lkml.kernel.org/r/20170529114141.536-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57c0a172
    • Vlastimil Babka's avatar
      mm, page_alloc: pass preferred nid instead of zonelist to allocator · 04ec6264
      Vlastimil Babka authored
      The main allocator function __alloc_pages_nodemask() takes a zonelist
      pointer as one of its parameters.  All of its callers directly or
      indirectly obtain the zonelist via node_zonelist() using a preferred
      node id and gfp_mask.  We can make the code a bit simpler by doing the
      zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node
      id instead (gfp_mask is already another parameter).
      
      There are some code size benefits thanks to removal of inlined
      node_zonelist():
      
        bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)
      
      This will also make things simpler if we proceed with converting cpusets
      to zonelists.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04ec6264
    • Michal Hocko's avatar
      mm, memory_hotplug: remove unused cruft after memory hotplug rework · 559bfc7d
      Michal Hocko authored
      zone_for_memory doesn't have any user anymore as well as the whole zone
      shifting infrastructure so drop them all.
      
      This shouldn't introduce any functional changes.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-15-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      559bfc7d
    • Michal Hocko's avatar
      mm, memory_hotplug: fix the section mismatch warning · cdf72f25
      Michal Hocko authored
      Tobias has reported following section mismatches introduced by "mm,
      memory_hotplug: do not associate hotadded memory to zones until online".
      
        WARNING: mm/built-in.o(.text+0x5a1c2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit memmap_init_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of memmap_init_zone is wrong.
      
        WARNING: mm/built-in.o(.text+0x5a25b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit init_currently_empty_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of init_currently_empty_zone is wrong.
      
        WARNING: vmlinux.o(.text+0x188aa2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit memmap_init_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of memmap_init_zone is wrong.
      
        WARNING: vmlinux.o(.text+0x188b3b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit init_currently_empty_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of init_currently_empty_zone is wrong.
      
      Both memmap_init_zone and init_currently_empty_zone are marked __meminit
      but move_pfn_range_to_zone is used outside of __meminit sections (e.g.
      devm_memremap_pages) so we have to hide it from the checker by __ref
      annotation.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-14-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTobias Regnery <tobias.regnery@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdf72f25
    • Michal Hocko's avatar
      mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory · 3d79a728
      Michal Hocko authored
      arch_add_memory gets for_device argument which then controls whether we
      want to create memblocks for created memory sections.  Simplify the
      logic by telling whether we want memblocks directly rather than going
      through pointless negation.  This also makes the api easier to
      understand because it is clear what we want rather than nothing telling
      for_device which can mean anything.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d79a728
    • Michal Hocko's avatar
      mm, memory_hotplug: do not assume ZONE_NORMAL is default kernel zone · c246a213
      Michal Hocko authored
      Heiko Carstens has noticed that he can generate overlapping zones for
      ZONE_DMA and ZONE_NORMAL:
      
        DMA      [mem 0x0000000000000000-0x000000007fffffff]
        Normal   [mem 0x0000000080000000-0x000000017fffffff]
      
        $ cat /sys/devices/system/memory/block_size_bytes
        10000000
        $ cat /sys/devices/system/memory/memory5/valid_zones
        DMA
        $ echo 0 > /sys/devices/system/memory/memory5/online
        $ cat /sys/devices/system/memory/memory5/valid_zones
        Normal
        $ echo 1 > /sys/devices/system/memory/memory5/online
        Normal
      
        $ cat /proc/zoneinfo
        Node 0, zone      DMA
        spanned  524288        <-----
        present  458752
        managed  455078
        start_pfn:           0 <-----
      
        Node 0, zone   Normal
        spanned  720896
        present  589824
        managed  571648
        start_pfn:           327680 <-----
      
      The reason is that we assume that the default zone for kernel onlining
      is ZONE_NORMAL.  This was a simplification introduced by the memory
      hotplug rework and it is easily fixable by checking the range overlap in
      the zone order and considering the first matching zone as the default
      one.  If there is no such zone then assume ZONE_NORMAL as we have been
      doing so far.
      
      Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online"
      Link: http://lkml.kernel.org/r/20170601083746.4924-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c246a213
    • Michal Hocko's avatar
      mm, memory_hotplug: fix MMOP_ONLINE_KEEP behavior · a69578a1
      Michal Hocko authored
      Heiko Carstens has noticed that the MMOP_ONLINE_KEEP is broken currently
      
        $ grep . memory3?/valid_zones
        memory34/valid_zones:Normal Movable
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Normal Movable
      
        $ echo online_movable > memory34/state
        $ grep . memory3?/valid_zones
        memory34/valid_zones:Movable
        memory35/valid_zones:Movable
        memory36/valid_zones:Movable
        memory37/valid_zones:Movable
      
        $ echo online > memory36/state
        $ grep . memory3?/valid_zones
        memory34/valid_zones:Movable
        memory36/valid_zones:Normal
        memory37/valid_zones:Movable
      
      so we have effectively punched a hole into the movable zone.
      
      The problem is that move_pfn_range() check for MMOP_ONLINE_KEEP is
      wrong.  It only checks whether the given range is already part of the
      movable zone which is not the case here as only memory34 is in the zone.
      Fix this by using allow_online_pfn_range(..., MMOP_ONLINE_KERNEL) if
      that is false then we can be sure that movable onlining is the right
      thing to do.
      
      Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online"
      Link: http://lkml.kernel.org/r/20170601083746.4924-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a69578a1
    • Michal Hocko's avatar
      mm, memory_hotplug: do not associate hotadded memory to zones until online · f1dd2cd1
      Michal Hocko authored
      The current memory hotplug implementation relies on having all the
      struct pages associate with a zone/node during the physical hotplug
      phase (arch_add_memory->__add_pages->__add_section->__add_zone).  In the
      vast majority of cases this means that they are added to ZONE_NORMAL.
      This has been so since 9d99aaa3 ("[PATCH] x86_64: Support memory
      hotadd without sparsemem") and it wasn't a big deal back then because
      movable onlining didn't exist yet.
      
      Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
      onlining 511c2aba ("mm, memory-hotplug: dynamic configure movable
      memory and portion memory") and then things got more complicated.
      Rather than reconsidering the zone association which was no longer
      needed (because the memory hotplug already depended on SPARSEMEM) a
      convoluted semantic of zone shifting has been developed.  Only the
      currently last memblock or the one adjacent to the zone_movable can be
      onlined movable.  This essentially means that the online type changes as
      the new memblocks are added.
      
      Let's simulate memory hot online manually
        $ echo 0x100000000 > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory32/valid_zones
        Normal Movable
      
        $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        $ echo online_movable > /sys/devices/system/memory/memory34/state
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable Normal
      
      This is an awkward semantic because an udev event is sent as soon as the
      block is onlined and an udev handler might want to online it based on
      some policy (e.g.  association with a node) but it will inherently race
      with new blocks showing up.
      
      This patch changes the physical online phase to not associate pages with
      any zone at all.  All the pages are just marked reserved and wait for
      the onlining phase to be associated with the zone as per the online
      request.  There are only two requirements
      
      	- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap
      
      	- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses
      
      the latter one is not an inherent requirement and can be changed in the
      future.  It preserves the current behavior and made the code slightly
      simpler.  This is subject to change in future.
      
      This means that the same physical online steps as above will lead to the
      following state: Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
      
      Implementation:
      The current move_pfn_range is reimplemented to check the above
      requirements (allow_online_pfn_range) and then updates the respective
      zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
      pfn range with the zone/node.  __add_pages is updated to not require the
      zone and only initializes sections in the range.  This allowed to
      simplify the arch_add_memory code (s390 could get rid of quite some of
      code).
      
      devm_memremap_pages is the only user of arch_add_memory which relies on
      the zone association because it only hooks into the memory hotplug only
      half way.  It uses it to associate the new memory with ZONE_DEVICE but
      doesn't allow it to be {on,off}lined via sysfs.  This means that this
      particular code path has to call move_pfn_range_to_zone explicitly.
      
      The original zone shifting code is kept in place and will be removed in
      the follow up patch for an easier review.
      
      Please note that this patch also changes the original behavior when
      offlining a memory block adjacent to another zone (Normal vs.  Movable)
      used to allow to change its movable type.  This will be handled later.
      
      [richard.weiyang@gmail.com: simplify zone_intersects()]
        Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
      [richard.weiyang@gmail.com: remove duplicate call for set_page_links]
        Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
      [akpm@linux-foundation.org: remove unused local `i']
      Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # For s390 bits
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1dd2cd1
    • Michal Hocko's avatar
      mm: consider zone which is not fully populated to have holes · 2d070eab
      Michal Hocko authored
      __pageblock_pfn_to_page has two users currently, set_zone_contiguous
      which checks whether the given zone contains holes and
      pageblock_pfn_to_page which then carefully returns a first valid page
      from the given pfn range for the given zone.  This doesn't handle zones
      which are not fully populated though.  Memory pageblocks can be offlined
      or might not have been onlined yet.  In such a case the zone should be
      considered to have holes otherwise pfn walkers can touch and play with
      offline pages.
      
      Current callers of pageblock_pfn_to_page in compaction seem to work
      properly right now because they only isolate PageBuddy
      (isolate_freepages_block) or PageLRU resp.  __PageMovable
      (isolate_migratepages_block) which will be always false for these pages.
      It would be safer to skip these pages altogether, though.
      
      In order to do this patch adds a new memory section state
      (SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
      in online_pages_range during the memory hotplug.  Similarly
      offline_mem_sections clears the bit and it is called when the memory
      range is offlined.
      
      pfn_to_online_page helper is then added which check the mem section and
      only returns a page if it is onlined already.
      
      Use the new helper in __pageblock_pfn_to_page and skip the whole page
      block in such a case.
      
      [mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
       mark sections online after all struct pages are initialized in
       online_pages_range (Vlastimil)]
        Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d070eab
    • Michal Hocko's avatar
      mm, memory_hotplug: split up register_one_node() · 9037a993
      Michal Hocko authored
      Memory hotplug (add_memory_resource) has to reinitialize node
      infrastructure if the node is offline (one which went through the
      complete add_memory(); remove_memory() cycle).  That involves node
      registration to the kobj infrastructure (register_node), the proper
      association with cpus (register_cpu_under_node) and finally creation of
      node<->memblock symlinks (link_mem_sections).
      
      The last part requires to know node_start_pfn and node_spanned_pages
      which we currently have but a leter patch will postpone this
      initialization to the onlining phase which happens later.  In fact we do
      not need to rely on the early pgdat initialization even now because the
      currently hot added pfn range is currently known.
      
      Split register_one_node into core which does all the common work for the
      boot time NUMA initialization and the hotplug (__register_one_node).
      register_one_node keeps the full initialization while hotplug calls
      __register_one_node and manually calls link_mem_sections for the proper
      range.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-6-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9037a993
    • Michal Hocko's avatar
      mm, memory_hotplug: get rid of is_zone_device_section · 1b862aec
      Michal Hocko authored
      Device memory hotplug hooks into regular memory hotplug only half way.
      It needs memory sections to track struct pages but there is no
      need/desire to associate those sections with memory blocks and export
      them to the userspace via sysfs because they cannot be onlined anyway.
      
      This is currently expressed by for_device argument to arch_add_memory
      which then makes sure to associate the given memory range with
      ZONE_DEVICE.  register_new_memory then relies on is_zone_device_section
      to distinguish special memory hotplug from the regular one.  While this
      works now, later patches in this series want to move __add_zone outside
      of arch_add_memory path so we have to come up with something else.
      
      Add want_memblock down the __add_pages path and use it to control
      whether the section->memblock association should be done.
      arch_add_memory then just trivially want memblock for everything but
      for_device hotplug.
      
      remove_memory_section doesn't need is_zone_device_section either.  We
      can simply skip all the memblock specific cleanup if there is no
      memblock for the given section.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b862aec
    • Michal Hocko's avatar
      mm, memory_hotplug: use node instead of zone in can_online_high_movable · c8f95657
      Michal Hocko authored
      The primary purpose of this helper is to query the node state so use the
      node id directly.  This is a preparatory patch for later changes.
      
      This shouldn't introduce any functional change
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8f95657
    • Michal Hocko's avatar
      mm: remove return value from init_currently_empty_zone · dc0bbf3b
      Michal Hocko authored
      Patch series "mm: make movable onlining suck less", v4.
      
      Movable onlining is a real hack with many downsides - mainly
      reintroduction of lowmem/highmem issues we used to have on 32b systems -
      but it is the only way to make the memory hotremove more reliable which
      is something that people are asking for.
      
      The current semantic of memory movable onlinening is really cumbersome,
      however.  The main reason for this is that the udev driven approach is
      basically unusable because udev races with the memory probing while only
      the last memory block or the one adjacent to the existing zone_movable
      are allowed to be onlined movable.  In short the criterion for the
      successful online_movable changes under udev's feet.  A reliable udev
      approach would require a 2 phase approach where the first successful
      movable online would have to check all the previous blocks and online
      them in descending order.  This is hard to be considered sane.
      
      This patchset aims at making the onlining semantic more usable.  First
      of all it allows to online memory movable as long as it doesn't clash
      with the existing ZONE_NORMAL.  That means that ZONE_NORMAL and
      ZONE_MOVABLE cannot overlap.  Currently I preserve the original ordering
      semantic so the zone always precedes the movable zone but I have plans
      to remove this restriction in future because it is not really necessary.
      
      First 3 patches are cleanups which should be ready to be merged right
      away (unless I have missed something subtle of course).
      
      Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path.
      
      Patch 5 deals with implicit assumptions of register_one_node on pgdat
      initialization.
      
      Patches 6-10 deal with offline holes in the zone for pfn walkers.  I
      hope I got all of them right but people familiar with compaction should
      double check this.
      
      Patch 11 is the core of the change.  In order to make it easier to
      review I have tried it to be as minimalistic as possible and the large
      code removal is moved to patch 14.
      
      Patch 12 is a trivial follow up cleanup.  Patch 13 fixes sparse warnings
      and finally patch 14 removes the unused code.
      
      I have tested the patches in kvm:
        # qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ...
      
      and then probed the additional memory by
        (qemu) object_add memory-backend-ram,id=mem1,size=1G
        (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
      
      Then I have used this simple script to probe the memory block by hand
        # cat probe_memblock.sh
        #!/bin/sh
      
        BLOCK_NR=$1
      
        # echo $((0x100000000+$BLOCK_NR*(128<<20))) > /sys/devices/system/memory/probe
      
        # for i in $(seq 10); do sh probe_memblock.sh $i; done
        # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
        /sys/devices/system/memory/memory35/valid_zones:Normal Movable
        /sys/devices/system/memory/memory36/valid_zones:Normal Movable
        /sys/devices/system/memory/memory37/valid_zones:Normal Movable
        /sys/devices/system/memory/memory38/valid_zones:Normal Movable
        /sys/devices/system/memory/memory39/valid_zones:Normal Movable
      
      The main difference to the original implementation is that all new
      memblocks can be both online_kernel and online_movable initially because
      there is no clash obviously.  For the comparison the original
      implementation would have
      
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal
        /sys/devices/system/memory/memory35/valid_zones:Normal
        /sys/devices/system/memory/memory36/valid_zones:Normal
        /sys/devices/system/memory/memory37/valid_zones:Normal
        /sys/devices/system/memory/memory38/valid_zones:Normal
        /sys/devices/system/memory/memory39/valid_zones:Normal Movable
      
      Now
        # echo online_movable > /sys/devices/system/memory/memory34/state
        # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
        /sys/devices/system/memory/memory36/valid_zones:Movable
        /sys/devices/system/memory/memory37/valid_zones:Movable
        /sys/devices/system/memory/memory38/valid_zones:Movable
        /sys/devices/system/memory/memory39/valid_zones:Movable
      
      Block 33 can still be online both kernel and movable while all
      the remaining can be only movable.
      
      /proc/zonelist says
        Node 0, zone   Normal
          pages free     0
                min      0
                low      0
                high     0
                spanned  0
                present  0
        --
        Node 0, zone  Movable
          pages free     32753
                min      85
                low      117
                high     149
                spanned  32768
                present  32768
      
      A new memblock at a lower address will result in a new memblock (32)
      which will still allow both Normal and Movable.
      
        # sh probe_memblock.sh 0
        # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
      
      and online_kernel will convert it to the zone normal properly
      while 33 can be still onlined both ways.
      
        # echo online_kernel > /sys/devices/system/memory/memory32/state
        # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
      
      /proc/zoneinfo will now tell
        Node 0, zone   Normal
          pages free     65441
                min      165
                low      230
                high     295
                spanned  65536
                present  65536
        --
        Node 0, zone  Movable
          pages free     32740
                min      82
                low      114
                high     146
                spanned  32768
                present  32768
      
      so both zones have one memblock spanned and present.
      
      Onlining 39 should associate this block to the movable zone
      
        # echo online > /sys/devices/system/memory/memory39/state
      
      /proc/zoneinfo will now tell
        Node 0, zone   Normal
          pages free     32765
                min      80
                low      112
                high     144
                spanned  32768
                present  32768
        --
        Node 0, zone  Movable
          pages free     65501
                min      160
                low      225
                high     290
                spanned  196608
                present  65536
      
      so we will have a movable zone which spans 6 memblocks, 2 present and 4
      representing a hole.
      
      Offlining both movable blocks will lead to the zone with no present
      pages which is the expected behavior I believe.
      
        # echo offline > /sys/devices/system/memory/memory39/state
        # echo offline > /sys/devices/system/memory/memory34/state
        # grep -A6 "Movable\|Normal" /proc/zoneinfo
        Node 0, zone   Normal
          pages free     32735
                min      90
                low      122
                high     154
                spanned  32768
                present  32768
        --
        Node 0, zone  Movable
          pages free     0
                min      0
                low      0
                high     0
                spanned  196608
                present  0
      
      As a bonus we will get a nice cleanup in the memory hotplug codebase.
      
      This patch (of 16):
      
      init_currently_empty_zone doesn't have any error to return yet it is
      still an int and callers try to be defensive and try to handle potential
      error.  Remove this nonsense and simplify all callers.
      
      This patch shouldn't have any visible effect
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc0bbf3b
  3. 03 May, 2017 1 commit
    • Mel Gorman's avatar
      mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx · e716f2eb
      Mel Gorman authored
      kswapd is woken to reclaim a node based on a failed allocation request
      from any eligible zone.  Once reclaiming in balance_pgdat(), it will
      continue reclaiming until there is an eligible zone available for the
      zone it was woken for.  kswapd tracks what zone it was recently woken
      for in pgdat->kswapd_classzone_idx.  If it has not been woken recently,
      this zone will be 0.
      
      However, the decision on whether to sleep is made on
      kswapd_classzone_idx which is 0 without a recent wakeup request and that
      classzone does not account for lowmem reserves.  This allows kswapd to
      sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
      request even if a stream of allocations cannot use that zone.  While
      kswapd may be woken again shortly in the near future there are two
      consequences -- the pgdat bits that control congestion are cleared
      prematurely and direct reclaim is more likely as kswapd slept
      prematurely.
      
      This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
      invalid index) when there has been no recent wakeups.  If there are no
      wakeups, it'll decide whether to sleep based on the highest possible
      zone available (MAX_NR_ZONES - 1).  It then becomes critical that the
      "pgdat balanced" decisions during reclaim and when deciding to sleep are
      the same.  If there is a mismatch, kswapd can stay awake continually
      trying to balance tiny zones.
      
      simoop was used to evaluate it again.  Two of the preparation patches
      regressed the workload so they are included as the second set of
      results.  Otherwise this patch looks artifically excellent
      
                                               4.11.0-rc1            4.11.0-rc1            4.11.0-rc1
                                                  vanilla              clear-v2          keepawake-v2
      Amean    p50-Read             21670074.18 (  0.00%) 19786774.76 (  8.69%) 22668332.52 ( -4.61%)
      Amean    p95-Read             25456267.64 (  0.00%) 24101956.27 (  5.32%) 26738688.00 ( -5.04%)
      Amean    p99-Read             29369064.73 (  0.00%) 27691872.71 (  5.71%) 30991404.52 ( -5.52%)
      Amean    p50-Write                1390.30 (  0.00%)     1011.91 ( 27.22%)      924.91 ( 33.47%)
      Amean    p95-Write              412901.57 (  0.00%)    34874.98 ( 91.55%)     1362.62 ( 99.67%)
      Amean    p99-Write             6668722.09 (  0.00%)   575449.60 ( 91.37%)    16854.04 ( 99.75%)
      Amean    p50-Allocation          78714.31 (  0.00%)    84246.26 ( -7.03%)    74729.74 (  5.06%)
      Amean    p95-Allocation         175533.51 (  0.00%)   400058.43 (-127.91%)   101609.74 ( 42.11%)
      Amean    p99-Allocation         247003.02 (  0.00%) 10905600.00 (-4315.17%)   125765.57 ( 49.08%)
      
      With this patch on top, write and allocation latencies are massively
      improved.  The read latencies are slightly impaired but it's worth
      noting that this is mostly due to the IO scheduler and not directly
      related to reclaim.  The vmstats are a bit of a mix but the relevant
      ones are as follows;
      
                                  4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
                                mmots-20170209 clear-v1r25keepawake-v1r25
      Swap Ins                             0           0           0
      Swap Outs                            0         608           0
      Direct pages scanned           6910672     3132699     6357298
      Kswapd pages scanned          57036946    82488665    56986286
      Kswapd pages reclaimed        55993488    63474329    55939113
      Direct pages reclaimed         6905990     2964843     6352115
      Kswapd efficiency                  98%         76%         98%
      Kswapd velocity              12494.375   17597.507   12488.065
      Direct efficiency                  99%         94%         99%
      Direct velocity               1513.835     668.306    1393.148
      Page writes by reclaim           0.000 4410243.000       0.000
      Page writes file                     0     4409635           0
      Page writes anon                     0         608           0
      Page reclaim immediate         1036792    14175203     1042571
      
                                  4.11.0-rc1  4.11.0-rc1  4.11.0-rc1
                                     vanilla  clear-v2  keepawake-v2
      Swap Ins                             0          12           0
      Swap Outs                            0         838           0
      Direct pages scanned           6579706     3237270     6256811
      Kswapd pages scanned          61853702    79961486    54837791
      Kswapd pages reclaimed        60768764    60755788    53849586
      Direct pages reclaimed         6579055     2987453     6256151
      Kswapd efficiency                  98%         75%         98%
      Page writes by reclaim           0.000 4389496.000       0.000
      Page writes file                     0     4388658           0
      Page writes anon                     0         838           0
      Page reclaim immediate         1073573    14473009      982507
      
      Swap-outs are equivalent to baseline.
      
      Direct reclaim is reduced but not eliminated.  It's worth noting that
      there are two periods of direct reclaim for this workload.  The first is
      when it switches from preparing the files for the actual test itself.
      It's a lot of file IO followed by a lot of allocs that reclaims heavily
      for a brief window.  While direct reclaim is lower with clear-v2, it is
      due to kswapd scanning aggressively and trying to reclaim the world
      which is not the right thing to do.  With the patches applied, there is
      still direct reclaim but the phase change from "creating work files" to
      starting multiple threads that allocate a lot of anonymous memory faster
      than kswapd can reclaim.
      
      Scanning/reclaim efficiency is restored by this patch.
      
      Page writes from reclaim context are back at 0 which is ideal.
      
      Pages immediately reclaimed after IO completes is slightly improved but
      it is expected this will vary slightly.
      
      On UMA, there is almost no change so this is not expected to be a
      universal win.
      
      [mgorman@suse.de: fix ->kswapd_classzone_idx initialization]
        Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de
      Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e716f2eb
  4. 16 Mar, 2017 1 commit
    • Heiko Carstens's avatar
      mm: add private lock to serialize memory hotplug operations · 55adc1d0
      Heiko Carstens authored
      Commit bfc8c901 ("mem-hotplug: implement get/put_online_mems")
      introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
      in order to allow similar semantics for memory hotplug like for cpu
      hotplug.
      
      The corresponding functions for cpu hotplug are get/put_online_cpus()
      and cpu_hotplug_begin/done() for cpu hotplug.
      
      The commit however missed to introduce functions that would serialize
      memory hotplug operations like they are done for cpu hotplug with
      cpu_maps_update_begin/done().
      
      This basically leaves mem_hotplug.active_writer unprotected and allows
      concurrent writers to modify it, which may lead to problems as outlined
      by commit f931ab47 ("mm: fix devm_memremap_pages crash, use
      mem_hotplug_{begin, done}").
      
      That commit was extended again with commit b5d24fda ("mm,
      devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
      done}") which serializes memory hotplug operations for some call sites
      by using the device_hotplug lock.
      
      In addition with commit 3fc21924 ("mm: validate device_hotplug is held
      for memory hotplug") a sanity check was added to mem_hotplug_begin() to
      verify that the device_hotplug lock is held.
      
      This in turn triggers the following warning on s390:
      
      WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
       Call Trace:
        assert_held_device_hotplug+0x40/0x58)
        mem_hotplug_begin+0x34/0xc8
        add_memory_resource+0x7e/0x1f8
        add_memory+0xda/0x130
        add_memory_merged+0x15c/0x178
        sclp_detect_standby_memory+0x2ae/0x2f8
        do_one_initcall+0xa2/0x150
        kernel_init_freeable+0x228/0x2d8
        kernel_init+0x2a/0x140
        kernel_thread_starter+0x6/0xc
      
      One possible fix would be to add more lock_device_hotplug() and
      unlock_device_hotplug() calls around each call site of
      mem_hotplug_begin/end().  But that would give the device_hotplug lock
      additional semantics it better should not have (serialize memory hotplug
      operations).
      
      Instead add a new memory_add_remove_lock which has the similar semantics
      like cpu_add_remove_lock for cpu hotplug.
      
      To keep things hopefully a bit easier the lock will be locked and unlocked
      within the mem_hotplug_begin/end() functions.
      
      Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.com
      
      
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: default avatarSebastian Ott <sebott@linux.vnet.ibm.com>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55adc1d0
  5. 02 Mar, 2017 1 commit
  6. 25 Feb, 2017 5 commits
  7. 23 Feb, 2017 1 commit
    • Yasuaki Ishimatsu's avatar
      mm/memory_hotplug: set magic number to page->freelist instead of page->lru.next · ddffe98d
      Yasuaki Ishimatsu authored
      To identify that pages of page table are allocated from bootmem
      allocator, magic number sets to page->lru.next.
      
      But page->lru list is initialized in reserve_bootmem_region().  So when
      calling free_pagetable(), the function cannot find the magic number of
      pages.  And free_pagetable() frees the pages by free_reserved_page() not
      put_page_bootmem().
      
      But if the pages are allocated from bootmem allocator and used as page
      table, the pages have private flag.  So before freeing the pages, we
      should clear the private flag by put_page_bootmem().
      
      Before applying the commit 7bfec6f4 ("mm, page_alloc: check multiple
      page fields with a single branch"), we could find the following visible
      issue:
      
        BUG: Bad page state in process kworker/u1024:1
        page:ffffea103cfd8040 count:0 mapcount:0 mappi
        flags: 0x6fffff80000800(private)
        page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
        bad because of flags: 0x800(private)
        <snip>
        Call Trace:
        [...] dump_stack+0x63/0x87
        [...] bad_page+0x114/0x130
        [...] free_pages_prepare+0x299/0x2d0
        [...] free_hot_cold_page+0x31/0x150
        [...] __free_pages+0x25/0x30
        [...] free_pagetable+0x6f/0xb4
        [...] remove_pagetable+0x379/0x7ff
        [...] vmemmap_free+0x10/0x20
        [...] sparse_remove_one_section+0x149/0x180
        [...] __remove_pages+0x2e9/0x4f0
        [...] arch_remove_memory+0x63/0xc0
        [...] remove_memory+0x8c/0xc0
        [...] acpi_memory_device_remove+0x79/0xa5
        [...] acpi_bus_trim+0x5a/0x8d
        [...] acpi_bus_trim+0x38/0x8d
        [...] acpi_device_hotplug+0x1b7/0x418
        [...] acpi_hotplug_work_fn+0x1e/0x29
        [...] process_one_work+0x152/0x400
        [...] worker_thread+0x125/0x4b0
        [...] kthread+0xd8/0xf0
        [...] ret_from_fork+0x22/0x40
      
      And the issue still silently occurs.
      
      Until freeing the pages of page table allocated from bootmem allocator,
      the page->freelist is never used.  So the patch sets magic number to
      page->freelist instead of page->lru.next.
      
      [isimatu.yasuaki@jp.fujitsu.com: fix merge issue]
        Link: http://lkml.kernel.org/r/722b1cc4-93ac-dd8b-2be2-7a7e313b3b0b@gmail.com
      Link: http://lkml.kernel.org/r/2c29bd9f-5b67-02d0-18a3-8828e78bbb6f@gmail.com
      
      
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddffe98d
  8. 03 Feb, 2017 2 commits
    • Toshi Kani's avatar
      base/memory, hotplug: fix a kernel oops in show_valid_zones() · a96dfddb
      Toshi Kani authored
      Reading a sysfs "memoryN/valid_zones" file leads to the following oops
      when the first page of a range is not backed by struct page.
      show_valid_zones() assumes that 'start_pfn' is always valid for
      page_zone().
      
       BUG: unable to handle kernel paging request at ffffea017a000000
       IP: show_valid_zones+0x6f/0x160
      
      This issue may happen on x86-64 systems with 64GiB or more memory since
      their memory block size is bumped up to 2GiB.  [1] An example of such
      systems is desribed below.  0x3240000000 is only aligned by 1GiB and
      this memory block starts from 0x3200000000, which is not backed by
      struct page.
      
       BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable
      
      Since test_pages_in_a_zone() already checks holes, fix this issue by
      extending this function to return 'valid_start' and 'valid_end' for a
      given range.  show_valid_zones() then proceeds with the valid range.
      
      [1] 'Commit bdee237c ("x86: mm: Use 2GB memory block size on
          large-memory x86-64 systems")'
      
      Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
      
      
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>	[4.4+]
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a96dfddb
    • Toshi Kani's avatar
      mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone() · deb88a2a
      Toshi Kani authored
      Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
      
      A sysfs memory file is created for each 2GiB memory block on x86-64 when
      the system has 64GiB or more memory.  [1] When the start address of a
      memory block is not backed by struct page, i.e.  a memory range is not
      aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
      kernel oops.  This issue was observed on multiple x86-64 systems with
      more than 64GiB of memory.  This patch-set fixes this issue.
      
      Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
      test the start section.
      
      Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
      to return valid [start, end).
      
      Note for stable kernels: The memory block size change was made by commit
      bdee237c ("x86: mm: Use 2GB memory block size on large-memory x86-64
      systems"), which was accepted to 3.9.  However, this patch-set depends
      on (and fixes) the change to test_pages_in_a_zone() made by commit
      5f0f2887 ("mm/memory_hotplug.c: check for missing sections in
      test_pages_in_a_zone()"), which was accepted to 4.4.
      
      So, I recommend that we backport it up to 4.4.
      
      [1] 'Commit bdee237c ("x86: mm: Use 2GB memory block size on
          large-memory x86-64 systems")'
      
      This patch (of 2):
      
      test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
      section since 'sec_end_pfn' is set equal to 'pfn'.  Since this function
      is called for testing the range of a sysfs memory file, 'start_pfn' is
      always aligned by section.
      
      Fix it by properly setting 'sec_end_pfn' to the next section pfn.
      
      Also make sure that this function returns 1 only when the range belongs
      to a zone.
      
      Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
      
      
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: <stable@vger.kernel.org>	[4.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      deb88a2a
  9. 25 Jan, 2017 1 commit
    • Yasuaki Ishimatsu's avatar
      memory_hotplug: make zone_can_shift() return a boolean value · 8a1f780e
      Yasuaki Ishimatsu authored
      online_{kernel|movable} is used to change the memory zone to
      ZONE_{NORMAL|MOVABLE} and online the memory.
      
      To check that memory zone can be changed, zone_can_shift() is used.
      Currently the function returns minus integer value, plus integer
      value and 0. When the function returns minus or plus integer value,
      it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.
      
      But when the function returns 0, there are two meanings.
      
      One of the meanings is that the memory zone does not need to be changed.
      For example, when memory is in ZONE_NORMAL and onlined by online_kernel
      the memory zone does not need to be changed.
      
      Another meaning is that the memory zone cannot be changed. When memory
      is in ZONE_NORMAL and onlined by online_movable, the memory zone may
      not be changed to ZONE_MOVALBE due to memory online limitation(see
      Documentation/memory-hotplug.txt). In this case, memory must not be
      onlined.
      
      The patch changes the return type of zone_can_shift() so that memory
      online operation fails when memory zone cannot be changed as follows:
      
      Before applying patch:
         # grep -A 35 "Node 2" /proc/zoneinfo
         Node 2, zone   Normal
         <snip>
            node_scanned  0
                 spanned  8388608
                 present  7864320
                 managed  7864320
         # echo online_movable > memory4097/state
         # grep -A 35 "Node 2" /proc/zoneinfo
         Node 2, zone   Normal
         <snip>
            node_scanned  0
                 spanned  8388608
                 present  8388608
                 managed  8388608
      
         online_movable operation succeeded. But memory is onlined as
         ZONE_NORMAL, not ZONE_MOVABLE.
      
      After applying patch:
         # grep -A 35 "Node 2" /proc/zoneinfo
         Node 2, zone   Normal
         <snip>
            node_scanned  0
                 spanned  8388608
                 present  7864320
                 managed  7864320
         # echo online_movable > memory4097/state
         bash: echo: write error: Invalid argument
         # grep -A 35 "Node 2" /proc/zoneinfo
         Node 2, zone   Normal
         <snip>
            node_scanned  0
                 spanned  8388608
                 present  7864320
                 managed  7864320
      
         online_movable operation failed because of failure of changing
         the memory zone from ZONE_NORMAL to ZONE_MOVABLE
      
      Fixes: df429ac0 ("memory-hotplug: more general validation of zone during online")
      Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com
      
      
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a1f780e
  10. 13 Dec, 2016 1 commit
    • Reza Arbab's avatar
      mm: remove x86-only restriction of movable_node · 39fa104d
      Reza Arbab authored
      In commit c5320926 ("mem-hotplug: introduce movable_node boot
      option"), the memblock allocation direction is changed to bottom-up and
      then back to top-down like this:
      
      1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
      2. memblock_set_bottom_up(false), called by x86's numa_init().
      
      Even though (1) occurs in generic mm code, it is wrapped by #ifdef
      CONFIG_MOVABLE_NODE, which depends on X86_64.
      
      This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
      things will be unbalanced.  (1) will happen for them, but (2) will not.
      
      This toggle was added in the first place because x86 has a delay between
      adding memblocks and marking them as hotpluggable.  Since other arches
      do this marking either immediately or not at all, they do not require
      the bottom-up toggle.
      
      So, resolve things by moving (1) from cmdline_parse_movable_node() to
      x86's setup_arch(), immediately after the movable_node parameter has
      been parsed.
      
      Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.com
      
      
      Signed-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39fa104d
  11. 27 Oct, 2016 2 commits
    • Linus Torvalds's avatar
      mm: remove unused variable in memory hotplug · 9db4f36e
      Linus Torvalds authored
      When I removed the per-zone bitlock hashed waitqueues in commit
      9dcb8b68
      
       ("mm: remove per-zone hashtable of bitlock waitqueues"), I
      removed all the magic hotplug memory initialization of said waitqueues
      too.
      
      But when I actually _tested_ the resulting build, I stupidly assumed
      that "allmodconfig" would enable memory hotplug.  And it doesn't,
      because it enables KASAN instead, which then disables hotplug memory
      support.
      
      As a result, my build test of the per-zone waitqueues was totally
      broken, and I didn't notice that the compiler warns about the now unused
      iterator variable 'i'.
      
      I guess I should be happy that that seems to be the worst breakage from
      my clearly horribly failed test coverage.
      
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9db4f36e
    • Linus Torvalds's avatar
      mm: remove per-zone hashtable of bitlock waitqueues · 9dcb8b68
      Linus Torvalds authored
      
      
      The per-zone waitqueues exist because of a scalability issue with the
      page waitqueues on some NUMA machines, but it turns out that they hurt
      normal loads, and now with the vmalloced stacks they also end up
      breaking gfs2 that uses a bit_wait on a stack object:
      
           wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
      
      where 'gh' can be a reference to the local variable 'mount_gh' on the
      stack of fill_super().
      
      The reason the per-zone hash table breaks for this case is that there is
      no "zone" for virtual allocations, and trying to look up the physical
      page to get at it will fail (with a BUG_ON()).
      
      It turns out that I actually complained to the mm people about the
      per-zone hash table for another reason just a month ago: the zone lookup
      also hurts the regular use of "unlock_page()" a lot, because the zone
      lookup ends up forcing several unnecessary cache misses and generates
      horrible code.
      
      As part of that earlier discussion, we had a much better solution for
      the NUMA scalability issue - by just making the page lock have a
      separate contention bit, the waitqueue doesn't even have to be looked at
      for the normal case.
      
      Peter Zijlstra already has a patch for that, but let's see if anybody
      even notices.  In the meantime, let's fix the actual gfs2 breakage by
      simplifying the bitlock waitqueues and removing the per-zone issue.
      
      Reported-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Tested-by: default avatarBob Peterson <rpeterso@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9dcb8b68
  12. 08 Oct, 2016 1 commit
  13. 28 Sep, 2016 1 commit
  14. 19 Sep, 2016 1 commit
  15. 11 Aug, 2016 1 commit
    • Reza Arbab's avatar
      mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats · 5830169f
      Reza Arbab authored
      The following oops occurs after a pgdat is hotadded:
      
        Unable to handle kernel paging request for data at address 0x00c30001
        Faulting instruction address: 0xc00000000022f8f4
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
        CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W 4.8.0-rc1-device #110
        task: c000000000ef3080 task.stack: c000000000f6c000
        NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
        REGS: c000000000f6fa50 TRAP: 0300   Tainted: G        W (4.8.0-rc1-device)
        MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 84002028  XER: 20000000
        CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
        NIP refresh_cpu_vm_stats+0x1a4/0x2f0
        LR refresh_cpu_vm_stats+0x1f8/0x2f0
        Call Trace:
          refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)
      
      Add per_cpu_nodestats initialization to the hotplug codepath.
      
      Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
      
      
      Signed-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5830169f