1. 06 Jul, 2017 20 commits
    • Michal Hocko's avatar
      mm, memory_hotplug: remove unused cruft after memory hotplug rework · 559bfc7d
      Michal Hocko authored
      zone_for_memory doesn't have any user anymore as well as the whole zone
      shifting infrastructure so drop them all.
      
      This shouldn't introduce any functional changes.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-15-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      559bfc7d
    • Michal Hocko's avatar
      mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory · 3d79a728
      Michal Hocko authored
      arch_add_memory gets for_device argument which then controls whether we
      want to create memblocks for created memory sections.  Simplify the
      logic by telling whether we want memblocks directly rather than going
      through pointless negation.  This also makes the api easier to
      understand because it is clear what we want rather than nothing telling
      for_device which can mean anything.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d79a728
    • Michal Hocko's avatar
      mm, memory_hotplug: do not assume ZONE_NORMAL is default kernel zone · c246a213
      Michal Hocko authored
      Heiko Carstens has noticed that he can generate overlapping zones for
      ZONE_DMA and ZONE_NORMAL:
      
        DMA      [mem 0x0000000000000000-0x000000007fffffff]
        Normal   [mem 0x0000000080000000-0x000000017fffffff]
      
        $ cat /sys/devices/system/memory/block_size_bytes
        10000000
        $ cat /sys/devices/system/memory/memory5/valid_zones
        DMA
        $ echo 0 > /sys/devices/system/memory/memory5/online
        $ cat /sys/devices/system/memory/memory5/valid_zones
        Normal
        $ echo 1 > /sys/devices/system/memory/memory5/online
        Normal
      
        $ cat /proc/zoneinfo
        Node 0, zone      DMA
        spanned  524288        <-----
        present  458752
        managed  455078
        start_pfn:           0 <-----
      
        Node 0, zone   Normal
        spanned  720896
        present  589824
        managed  571648
        start_pfn:           327680 <-----
      
      The reason is that we assume that the default zone for kernel onlining
      is ZONE_NORMAL.  This was a simplification introduced by the memory
      hotplug rework and it is easily fixable by checking the range overlap in
      the zone order and considering the first matching zone as the default
      one.  If there is no such zone then assume ZONE_NORMAL as we have been
      doing so far.
      
      Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online"
      Link: http://lkml.kernel.org/r/20170601083746.4924-3-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c246a213
    • Michal Hocko's avatar
      mm, memory_hotplug: do not associate hotadded memory to zones until online · f1dd2cd1
      Michal Hocko authored
      The current memory hotplug implementation relies on having all the
      struct pages associate with a zone/node during the physical hotplug
      phase (arch_add_memory->__add_pages->__add_section->__add_zone).  In the
      vast majority of cases this means that they are added to ZONE_NORMAL.
      This has been so since 9d99aaa3 ("[PATCH] x86_64: Support memory
      hotadd without sparsemem") and it wasn't a big deal back then because
      movable onlining didn't exist yet.
      
      Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
      onlining 511c2aba ("mm, memory-hotplug: dynamic configure movable
      memory and portion memory") and then things got more complicated.
      Rather than reconsidering the zone association which was no longer
      needed (because the memory hotplug already depended on SPARSEMEM) a
      convoluted semantic of zone shifting has been developed.  Only the
      currently last memblock or the one adjacent to the zone_movable can be
      onlined movable.  This essentially means that the online type changes as
      the new memblocks are added.
      
      Let's simulate memory hot online manually
        $ echo 0x100000000 > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory32/valid_zones
        Normal Movable
      
        $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        $ echo online_movable > /sys/devices/system/memory/memory34/state
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable Normal
      
      This is an awkward semantic because an udev event is sent as soon as the
      block is onlined and an udev handler might want to online it based on
      some policy (e.g.  association with a node) but it will inherently race
      with new blocks showing up.
      
      This patch changes the physical online phase to not associate pages with
      any zone at all.  All the pages are just marked reserved and wait for
      the onlining phase to be associated with the zone as per the online
      request.  There are only two requirements
      
      	- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap
      
      	- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses
      
      the latter one is not an inherent requirement and can be changed in the
      future.  It preserves the current behavior and made the code slightly
      simpler.  This is subject to change in future.
      
      This means that the same physical online steps as above will lead to the
      following state: Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
      
      Implementation:
      The current move_pfn_range is reimplemented to check the above
      requirements (allow_online_pfn_range) and then updates the respective
      zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
      pfn range with the zone/node.  __add_pages is updated to not require the
      zone and only initializes sections in the range.  This allowed to
      simplify the arch_add_memory code (s390 could get rid of quite some of
      code).
      
      devm_memremap_pages is the only user of arch_add_memory which relies on
      the zone association because it only hooks into the memory hotplug only
      half way.  It uses it to associate the new memory with ZONE_DEVICE but
      doesn't allow it to be {on,off}lined via sysfs.  This means that this
      particular code path has to call move_pfn_range_to_zone explicitly.
      
      The original zone shifting code is kept in place and will be removed in
      the follow up patch for an easier review.
      
      Please note that this patch also changes the original behavior when
      offlining a memory block adjacent to another zone (Normal vs.  Movable)
      used to allow to change its movable type.  This will be handled later.
      
      [richard.weiyang@gmail.com: simplify zone_intersects()]
        Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
      [richard.weiyang@gmail.com: remove duplicate call for set_page_links]
        Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
      [akpm@linux-foundation.org: remove unused local `i']
      Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # For s390 bits
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1dd2cd1
    • Michal Hocko's avatar
      mm: consider zone which is not fully populated to have holes · 2d070eab
      Michal Hocko authored
      __pageblock_pfn_to_page has two users currently, set_zone_contiguous
      which checks whether the given zone contains holes and
      pageblock_pfn_to_page which then carefully returns a first valid page
      from the given pfn range for the given zone.  This doesn't handle zones
      which are not fully populated though.  Memory pageblocks can be offlined
      or might not have been onlined yet.  In such a case the zone should be
      considered to have holes otherwise pfn walkers can touch and play with
      offline pages.
      
      Current callers of pageblock_pfn_to_page in compaction seem to work
      properly right now because they only isolate PageBuddy
      (isolate_freepages_block) or PageLRU resp.  __PageMovable
      (isolate_migratepages_block) which will be always false for these pages.
      It would be safer to skip these pages altogether, though.
      
      In order to do this patch adds a new memory section state
      (SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
      in online_pages_range during the memory hotplug.  Similarly
      offline_mem_sections clears the bit and it is called when the memory
      range is offlined.
      
      pfn_to_online_page helper is then added which check the mem section and
      only returns a page if it is onlined already.
      
      Use the new helper in __pageblock_pfn_to_page and skip the whole page
      block in such a case.
      
      [mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
       mark sections online after all struct pages are initialized in
       online_pages_range (Vlastimil)]
        Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d070eab
    • Michal Hocko's avatar
      mm, memory_hotplug: split up register_one_node() · 9037a993
      Michal Hocko authored
      Memory hotplug (add_memory_resource) has to reinitialize node
      infrastructure if the node is offline (one which went through the
      complete add_memory(); remove_memory() cycle).  That involves node
      registration to the kobj infrastructure (register_node), the proper
      association with cpus (register_cpu_under_node) and finally creation of
      node<->memblock symlinks (link_mem_sections).
      
      The last part requires to know node_start_pfn and node_spanned_pages
      which we currently have but a leter patch will postpone this
      initialization to the onlining phase which happens later.  In fact we do
      not need to rely on the early pgdat initialization even now because the
      currently hot added pfn range is currently known.
      
      Split register_one_node into core which does all the common work for the
      boot time NUMA initialization and the hotplug (__register_one_node).
      register_one_node keeps the full initialization while hotplug calls
      __register_one_node and manually calls link_mem_sections for the proper
      range.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-6-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9037a993
    • Michal Hocko's avatar
      mm, memory_hotplug: get rid of is_zone_device_section · 1b862aec
      Michal Hocko authored
      Device memory hotplug hooks into regular memory hotplug only half way.
      It needs memory sections to track struct pages but there is no
      need/desire to associate those sections with memory blocks and export
      them to the userspace via sysfs because they cannot be onlined anyway.
      
      This is currently expressed by for_device argument to arch_add_memory
      which then makes sure to associate the given memory range with
      ZONE_DEVICE.  register_new_memory then relies on is_zone_device_section
      to distinguish special memory hotplug from the regular one.  While this
      works now, later patches in this series want to move __add_zone outside
      of arch_add_memory path so we have to come up with something else.
      
      Add want_memblock down the __add_pages path and use it to control
      whether the section->memblock association should be done.
      arch_add_memory then just trivially want memblock for everything but
      for_device hotplug.
      
      remove_memory_section doesn't need is_zone_device_section either.  We
      can simply skip all the memblock specific cleanup if there is no
      memblock for the given section.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b862aec
    • Michal Hocko's avatar
      mm: remove return value from init_currently_empty_zone · dc0bbf3b
      Michal Hocko authored
      Patch series "mm: make movable onlining suck less", v4.
      
      Movable onlining is a real hack with many downsides - mainly
      reintroduction of lowmem/highmem issues we used to have on 32b systems -
      but it is the only way to make the memory hotremove more reliable which
      is something that people are asking for.
      
      The current semantic of memory movable onlinening is really cumbersome,
      however.  The main reason for this is that the udev driven approach is
      basically unusable because udev races with the memory probing while only
      the last memory block or the one adjacent to the existing zone_movable
      are allowed to be onlined movable.  In short the criterion for the
      successful online_movable changes under udev's feet.  A reliable udev
      approach would require a 2 phase approach where the first successful
      movable online would have to check all the previous blocks and online
      them in descending order.  This is hard to be considered sane.
      
      This patchset aims at making the onlining semantic more usable.  First
      of all it allows to online memory movable as long as it doesn't clash
      with the existing ZONE_NORMAL.  That means that ZONE_NORMAL and
      ZONE_MOVABLE cannot overlap.  Currently I preserve the original ordering
      semantic so the zone always precedes the movable zone but I have plans
      to remove this restriction in future because it is not really necessary.
      
      First 3 patches are cleanups which should be ready to be merged right
      away (unless I have missed something subtle of course).
      
      Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path.
      
      Patch 5 deals with implicit assumptions of register_one_node on pgdat
      initialization.
      
      Patches 6-10 deal with offline holes in the zone for pfn walkers.  I
      hope I got all of them right but people familiar with compaction should
      double check this.
      
      Patch 11 is the core of the change.  In order to make it easier to
      review I have tried it to be as minimalistic as possible and the large
      code removal is moved to patch 14.
      
      Patch 12 is a trivial follow up cleanup.  Patch 13 fixes sparse warnings
      and finally patch 14 removes the unused code.
      
      I have tested the patches in kvm:
        # qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ...
      
      and then probed the additional memory by
        (qemu) object_add memory-backend-ram,id=mem1,size=1G
        (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
      
      Then I have used this simple script to probe the memory block by hand
        # cat probe_memblock.sh
        #!/bin/sh
      
        BLOCK_NR=$1
      
        # echo $((0x100000000+$BLOCK_NR*(128<<20))) > /sys/devices/system/memory/probe
      
        # for i in $(seq 10); do sh probe_memblock.sh $i; done
        # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
        /sys/devices/system/memory/memory35/valid_zones:Normal Movable
        /sys/devices/system/memory/memory36/valid_zones:Normal Movable
        /sys/devices/system/memory/memory37/valid_zones:Normal Movable
        /sys/devices/system/memory/memory38/valid_zones:Normal Movable
        /sys/devices/system/memory/memory39/valid_zones:Normal Movable
      
      The main difference to the original implementation is that all new
      memblocks can be both online_kernel and online_movable initially because
      there is no clash obviously.  For the comparison the original
      implementation would have
      
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal
        /sys/devices/system/memory/memory35/valid_zones:Normal
        /sys/devices/system/memory/memory36/valid_zones:Normal
        /sys/devices/system/memory/memory37/valid_zones:Normal
        /sys/devices/system/memory/memory38/valid_zones:Normal
        /sys/devices/system/memory/memory39/valid_zones:Normal Movable
      
      Now
        # echo online_movable > /sys/devices/system/memory/memory34/state
        # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
        /sys/devices/system/memory/memory36/valid_zones:Movable
        /sys/devices/system/memory/memory37/valid_zones:Movable
        /sys/devices/system/memory/memory38/valid_zones:Movable
        /sys/devices/system/memory/memory39/valid_zones:Movable
      
      Block 33 can still be online both kernel and movable while all
      the remaining can be only movable.
      
      /proc/zonelist says
        Node 0, zone   Normal
          pages free     0
                min      0
                low      0
                high     0
                spanned  0
                present  0
        --
        Node 0, zone  Movable
          pages free     32753
                min      85
                low      117
                high     149
                spanned  32768
                present  32768
      
      A new memblock at a lower address will result in a new memblock (32)
      which will still allow both Normal and Movable.
      
        # sh probe_memblock.sh 0
        # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
      
      and online_kernel will convert it to the zone normal properly
      while 33 can be still onlined both ways.
      
        # echo online_kernel > /sys/devices/system/memory/memory32/state
        # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
      
      /proc/zoneinfo will now tell
        Node 0, zone   Normal
          pages free     65441
                min      165
                low      230
                high     295
                spanned  65536
                present  65536
        --
        Node 0, zone  Movable
          pages free     32740
                min      82
                low      114
                high     146
                spanned  32768
                present  32768
      
      so both zones have one memblock spanned and present.
      
      Onlining 39 should associate this block to the movable zone
      
        # echo online > /sys/devices/system/memory/memory39/state
      
      /proc/zoneinfo will now tell
        Node 0, zone   Normal
          pages free     32765
                min      80
                low      112
                high     144
                spanned  32768
                present  32768
        --
        Node 0, zone  Movable
          pages free     65501
                min      160
                low      225
                high     290
                spanned  196608
                present  65536
      
      so we will have a movable zone which spans 6 memblocks, 2 present and 4
      representing a hole.
      
      Offlining both movable blocks will lead to the zone with no present
      pages which is the expected behavior I believe.
      
        # echo offline > /sys/devices/system/memory/memory39/state
        # echo offline > /sys/devices/system/memory/memory34/state
        # grep -A6 "Movable\|Normal" /proc/zoneinfo
        Node 0, zone   Normal
          pages free     32735
                min      90
                low      122
                high     154
                spanned  32768
                present  32768
        --
        Node 0, zone  Movable
          pages free     0
                min      0
                low      0
                high     0
                spanned  196608
                present  0
      
      As a bonus we will get a nice cleanup in the memory hotplug codebase.
      
      This patch (of 16):
      
      init_currently_empty_zone doesn't have any error to return yet it is
      still an int and callers try to be defensive and try to handle potential
      error.  Remove this nonsense and simplify all callers.
      
      This patch shouldn't have any visible effect
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-2-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc0bbf3b
    • Huang Ying's avatar
      mm, THP, swap: check whether THP can be split firstly · b8f593cd
      Huang Ying authored
      To swap out THP (Transparent Huage Page), before splitting the THP, the
      swap cluster will be allocated and the THP will be added into the swap
      cache.  But it is possible that the THP cannot be split, so that we must
      delete the THP from the swap cache and free the swap cluster.  To avoid
      that, in this patch, whether the THP can be split is checked firstly.
      The check can only be done racy, but it is good enough for most cases.
      
      With the patch, the swap out throughput improves 3.6% (from about
      4.16GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
      with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      
      Link: http://lkml.kernel.org/r/20170515112522.32457-5-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for can_split_huge_page()]
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8f593cd
    • Minchan Kim's avatar
      mm, THP, swap: move anonymous THP split logic to vmscan · 0f074658
      Minchan Kim authored
      The add_to_swap aims to allocate swap_space(ie, swap slot and swapcache)
      so if it fails due to lack of space in case of THP or something(hdd swap
      but tries THP swapout) *caller* rather than add_to_swap itself should
      split the THP page and retry it with base page which is more natural.
      
      Link: http://lkml.kernel.org/r/20170515112522.32457-4-ying.huang@intel.com
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f074658
    • Minchan Kim's avatar
      mm, THP, swap: unify swap slot free functions to put_swap_page · 75f6d6d2
      Minchan Kim authored
      Now, get_swap_page takes struct page and allocates swap space according
      to page size(ie, normal or THP) so it would be more cleaner to introduce
      put_swap_page which is a counter function of get_swap_page.  Then, it
      calls right swap slot free function depending on page's size.
      
      [ying.huang@intel.com: minor cleanup and fix]
      Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75f6d6d2
    • Huang Ying's avatar
      mm, THP, swap: delay splitting THP during swap out · 38d8b4e6
      Huang Ying authored
      Patch series "THP swap: Delay splitting THP during swapping out", v11.
      
      This patchset is to optimize the performance of Transparent Huge Page
      (THP) swap.
      
      Recently, the performance of the storage devices improved so fast that
      we cannot saturate the disk bandwidth with single logical CPU when do
      page swap out even on a high-end server machine.  Because the
      performance of the storage device improved faster than that of single
      logical CPU.  And it seems that the trend will not change in the near
      future.  On the other hand, the THP becomes more and more popular
      because of increased memory size.  So it becomes necessary to optimize
      THP swap performance.
      
      The advantages of the THP swap support include:
      
       - Batch the swap operations for the THP to reduce lock
         acquiring/releasing, including allocating/freeing the swap space,
         adding/deleting to/from the swap cache, and writing/reading the swap
         space, etc. This will help improve the performance of the THP swap.
      
       - The THP swap space read/write will be 2M sequential IO. It is
         particularly helpful for the swap read, which are usually 4k random
         IO. This will improve the performance of the THP swap too.
      
       - It will help the memory fragmentation, especially when the THP is
         heavily used by the applications. The 2M continuous pages will be
         free up after THP swapping out.
      
       - It will improve the THP utilization on the system with the swap
         turned on. Because the speed for khugepaged to collapse the normal
         pages into the THP is quite slow. After the THP is split during the
         swapping out, it will take quite long time for the normal pages to
         collapse back into the THP after being swapped in. The high THP
         utilization helps the efficiency of the page based memory management
         too.
      
      There are some concerns regarding THP swap in, mainly because possible
      enlarged read/write IO size (for swap in/out) may put more overhead on
      the storage device.  To deal with that, the THP swap in should be turned
      on only when necessary.  For example, it can be selected via
      "always/never/madvise" logic, to be turned on globally, turned off
      globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
      
      This patchset is the first step for the THP swap support.  The plan is
      to delay splitting THP step by step, finally avoid splitting THP during
      the THP swapping out and swap out/in the THP as a whole.
      
      As the first step, in this patchset, the splitting huge page is delayed
      from almost the first step of swapping out to after allocating the swap
      space for the THP and adding the THP into the swap cache.  This will
      reduce lock acquiring/releasing for the locks used for the swap cache
      management.
      
      With the patchset, the swap out throughput improves 15.5% (from about
      3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
      with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      
      This patch (of 5):
      
      In this patch, splitting huge page is delayed from almost the first step
      of swapping out to after allocating the swap space for the THP
      (Transparent Huge Page) and adding the THP into the swap cache.  This
      will batch the corresponding operation, thus improve THP swap out
      throughput.
      
      This is the first step for the THP swap optimization.  The plan is to
      delay splitting the THP step by step and avoid splitting the THP
      finally.
      
      In this patch, one swap cluster is used to hold the contents of each THP
      swapped out.  So, the size of the swap cluster is changed to that of the
      THP (Transparent Huge Page) on x86_64 architecture (512).  For other
      architectures which want such THP swap optimization,
      ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
      the architecture.  In effect, this will enlarge swap cluster size by 2
      times on x86_64.  Which may make it harder to find a free cluster when
      the swap space becomes fragmented.  So that, this may reduce the
      continuous swap space allocation and sequential write in theory.  The
      performance test in 0day shows no regressions caused by this.
      
      In the future of THP swap optimization, some information of the swapped
      out THP (such as compound map count) will be recorded in the
      swap_cluster_info data structure.
      
      The mem cgroup swap accounting functions are enhanced to support charge
      or uncharge a swap cluster backing a THP as a whole.
      
      The swap cluster allocate/free functions are added to allocate/free a
      swap cluster for a THP.  A fair simple algorithm is used for swap
      cluster allocation, that is, only the first swap device in priority list
      will be tried to allocate the swap cluster.  The function will fail if
      the trying is not successful, and the caller will fallback to allocate a
      single swap slot instead.  This works good enough for normal cases.  If
      the difference of the number of the free swap clusters among multiple
      swap devices is significant, it is possible that some THPs are split
      earlier than necessary.  For example, this could be caused by big size
      difference among multiple swap devices.
      
      The swap cache functions is enhanced to support add/delete THP to/from
      the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
      enhanced in the future with multi-order radix tree.  But because we will
      split the THP soon during swapping out, that optimization doesn't make
      much sense for this first step.
      
      The THP splitting functions are enhanced to support to split THP in swap
      cache during swapping out.  The page lock will be held during allocating
      the swap cluster, adding the THP into the swap cache and splitting the
      THP.  So in the code path other than swapping out, if the THP need to be
      split, the PageSwapCache(THP) will be always false.
      
      The swap cluster is only available for SSD, so the THP swap optimization
      in this patchset has no effect for HDD.
      
      [ying.huang@intel.com: fix two issues in THP optimize patch]
        Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
      [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
      Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
      Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38d8b4e6
    • Dave Hansen's avatar
      mm, sparsemem: break out of loops early · c4e1be9e
      Dave Hansen authored
      There are a number of times that we loop over NR_MEM_SECTIONS, looking
      for section_present() on each section.  But, when we have very large
      physical address spaces (large MAX_PHYSMEM_BITS), NR_MEM_SECTIONS
      becomes very large, making the loops quite long.
      
      With MAX_PHYSMEM_BITS=46 and a section size of 128MB, the current loops
      are 512k iterations, which we barely notice on modern hardware.  But,
      raising MAX_PHYSMEM_BITS higher (like we will see on systems that
      support 5-level paging) makes this 64x longer and we start to notice,
      especially on slower systems like simulators.  A 10-second delay for
      512k iterations is annoying.  But, a 640- second delay is crippling.
      
      This does not help if we have extremely sparse physical address spaces,
      but those are quite rare.  We expect that most of the "slow" systems
      where this matters will also be quite small and non-sparse.
      
      To fix this, we track the highest section we've ever encountered.  This
      lets us know when we will *never* see another section_present(), and
      lets us break out of the loops earlier.
      
      Doing the whole for_each_present_section_nr() macro is probably
      overkill, but it will ensure that any future loop iterations that we
      grow are more likely to be correct.
      
      Kirrill said "It shaved almost 40 seconds from boot time in qemu with
      5-level paging enabled for me".
      
      Link: http://lkml.kernel.org/r/20170504174434.C45A4735@viggo.jf.intel.com
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Tested-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4e1be9e
    • Wei Yang's avatar
      mm/slub.c: wrap kmem_cache->cpu_partial in config CONFIG_SLUB_CPU_PARTIAL · e6d0e1dc
      Wei Yang authored
      kmem_cache->cpu_partial is just used when CONFIG_SLUB_CPU_PARTIAL is
      set, so wrap it with config CONFIG_SLUB_CPU_PARTIAL will save some space
      on 32bit arch.
      
      This patch wraps kmem_cache->cpu_partial in config CONFIG_SLUB_CPU_PARTIAL
      and wraps its sysfs too.
      
      Link: http://lkml.kernel.org/r/20170502144533.10729-4-richard.weiyang@gmail.com
      
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6d0e1dc
    • Wei Yang's avatar
      mm/slub.c: wrap cpu_slab->partial in CONFIG_SLUB_CPU_PARTIAL · a93cf07b
      Wei Yang authored
      cpu_slab's field partial is used when CONFIG_SLUB_CPU_PARTIAL is set,
      which means we can save a pointer's space on each cpu for every slub
      item.
      
      This patch wraps cpu_slab->partial in CONFIG_SLUB_CPU_PARTIAL and wraps
      its sysfs use too.
      
      [akpm@linux-foundation.org: avoid strange 80-col tricks]
      Link: http://lkml.kernel.org/r/20170502144533.10729-3-richard.weiyang@gmail.com
      
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a93cf07b
    • Wei Yang's avatar
      mm/slub.c: pack red_left_pad with another int to save a word · d3111e6c
      Wei Yang authored
      Patch series "try to save some memory for kmem_cache in some cases", v2.
      
      kmem_cache is a frequently used data in kernel.  During the code
      reading, I found maybe we could save some space in some cases.
      
      1. On 64bit arch, type int will occupy a word if it doesn't sit well.
      
      2. cpu_slab->partial is just used when CONFIG_SLUB_CPU_PARTIAL is set
      
      3. cpu_partial is just used when CONFIG_SLUB_CPU_PARTIAL is set, while
         just save some space on 32bit arch.
      
      This patch (of 3):
      
      On 64bit arch, struct is 8-bytes aligned, so int will occupy a word if
      it doesn't sit well.
      
      This patch pack red_left_pad with reserved to save 8 bytes for struct
      kmem_cache on a 64bit arch.
      
      Link: http://lkml.kernel.org/r/20170502144533.10729-2-richard.weiyang@gmail.com
      
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3111e6c
    • Fabian Frederick's avatar
      ocfs2: use magic.h · 62aa81d7
      Fabian Frederick authored
      Filesystems generally use SUPER_MAGIC values from magic.h instead of a
      local definition.
      
      Link: http://lkml.kernel.org/r/20170521154217.27917-1-fabf@skynet.be
      
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Reviewed-by: default avatarMark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62aa81d7
    • Michael Ellerman's avatar
      include/linux/filter.h: use linux/set_memory.h · 820a0b24
      Michael Ellerman authored
      This header always exists, so doesn't require an ifdef around its
      inclusion.  When CONFIG_ARCH_HAS_SET_MEMORY=y it includes the asm
      header, otherwise it provides empty versions of the set_memory_xx()
      routines.
      
      Link: http://lkml.kernel.org/r/1498717781-29151-4-git-send-email-mpe@ellerman.id.au
      
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarLaura Abbott <labbott@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      820a0b24
    • Michael Ellerman's avatar
      provide linux/set_memory.h · 938f8464
      Michael Ellerman authored
      Currently code that wants to use set_memory_ro() etc, needs to include
      asm/set_memory.h, which doesn't exist on all arches.  Some code knows it
      only builds on arches which have the header, other code guards the
      inclusion with an #ifdef, neither is ideal.
      
      So create linux/set_memory.h.  This always exists, so users don't need
      an #ifdef just to include the header.
      
      When CONFIG_ARCH_HAS_SET_MEMORY=y it includes asm/set_memory.h,
      otherwise it provides empty non-failing implementations.
      
      Link: http://lkml.kernel.org/r/1498717781-29151-1-git-send-email-mpe@ellerman.id.au
      
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarLaura Abbott <labbott@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      938f8464
    • David Rientjes's avatar
      compiler, clang: always inline when CONFIG_OPTIMIZE_INLINING is disabled · 9a04dbcf
      David Rientjes authored
      The motivation for commit abb2ea7d ("compiler, clang: suppress
      warning for unused static inline functions") was to suppress clang's
      warnings about unused static inline functions.
      
      For configs without CONFIG_OPTIMIZE_INLINING enabled, such as any non-x86
      architecture, `inline' in the kernel implies that
      __attribute__((always_inline)) is used.
      
      Some code depends on that behavior, see
        https://lkml.org/lkml/2017/6/13/918:
      
        net/built-in.o: In function `__xchg_mb':
        arch/arm64/include/asm/cmpxchg.h:99: undefined reference to `__compiletime_assert_99'
        arch/arm64/include/asm/cmpxchg.h:99: undefined reference to `__compiletime_assert_99
      
      The full fix would be to identify these breakages and annotate the
      functions with __always_inline instead of `inline'.  But since we are
      late in the 4.12-rc cycle, simply carry forward the forced inlining
      behavior and work toward moving arm64, and other architectures, toward
      CONFIG_OPTIMIZE_INLINING behavior.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706261552200.1075@chino.kir.corp.google.com
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarSodagudi Prasad <psodagud@codeaurora.org>
      Tested-by: default avatarSodagudi Prasad <psodagud@codeaurora.org>
      Tested-by: default avatarMatthias Kaehlcke <mka@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a04dbcf
  2. 04 Jul, 2017 20 commits