1. 10 Jul, 2017 8 commits
    • Michal Hocko's avatar
      mm, hugetlb: unclutter hugetlb allocation layers · aaf14e40
      Michal Hocko authored
      Patch series "mm, hugetlb: allow proper node fallback dequeue".
      While working on a hugetlb migration issue addressed in a separate
      patchset[1] I have noticed that the hugetlb allocations from the
      preallocated pool are quite subotimal.
       [1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org
      There is no fallback mechanism implemented and no notion of preferred
      node.  I have tried to work around it but Vlastimil was right to push
      back for a more robust solution.  It seems that such a solution is to
      reuse zonelist approach we use for the page alloctor.
      This series has 3 patches.  The first one tries to make hugetlb
      allocation layers more clear.  The second one implements the zonelist
      hugetlb pool allocation and introduces a preferred node semantic which
      is used by the migration callbacks.  The last patch is a clean up.
      This patch (of 3):
      Hugetlb allocation path for fresh huge pages is unnecessarily complex
      and it mixes different interfaces between layers.
      __alloc_buddy_huge_page is the central place to perform a new
      allocation.  It checks for the hugetlb overcommit and then relies on
      __hugetlb_alloc_buddy_huge_page to invoke the page allocator.  This is
      all good except that __alloc_buddy_huge_page pushes vma and address down
      the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
      two different allocation modes - one for memory policy and other node
      specific (or to make it more obscure node non-specific) requests.
      This just screams for a reorganization.
      This patch pulls out all the vma specific handling up to
      __alloc_buddy_huge_page_with_mpol where it belongs.
      __alloc_buddy_huge_page will get nodemask argument and
      __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
      page allocator.
      In short:
      __alloc_buddy_huge_page_with_mpol - memory policy handling
        __alloc_buddy_huge_page - overcommit handling and accounting
          __hugetlb_alloc_buddy_huge_page - page allocator layer
      Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
      is not really needed because the page allocator already handles the
      cpusets update.
      Finally __hugetlb_alloc_buddy_huge_page had a special case for node
      specific allocations (when no policy is applied and there is a node
      given).  This has relied on __GFP_THISNODE to not fallback to a different
      node.  alloc_huge_page_node is the only caller which relies on this
      behavior so move the __GFP_THISNODE there.
      Not only does this remove quite some code it also should make those
      layers easier to follow and clear wrt responsibilities.
      Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Matthew Wilcox's avatar
      mm/hugetlb.c: replace memfmt with string_get_size · c6247f72
      Matthew Wilcox authored
      The hugetlb code has its own function to report human-readable sizes.
      Convert it to use the shared string_get_size() function.  This will lead
      to a minor difference in user visible output (MiB/GiB instead of MB/GB),
      but some would argue that's desirable anyway.
      Link: http://lkml.kernel.org/r/20170606190350.GA20010@bombadil.infradead.org
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • David Rientjes's avatar
      mm, hugetlb: schedule when potentially allocating many hugepages · 69ed779a
      David Rientjes authored
      A few hugetlb allocators loop while calling the page allocator and can
      potentially prevent rescheduling if the page allocator slowpath is not
      Conditionally schedule when large numbers of hugepages can be allocated.
       "Fixes a task which was getting hung while writing like 10000 hugepages
        (16MB on POWER8) into /proc/sys/vm/nr_hugepages."
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.com
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Michal Hocko's avatar
      hugetlb, memory_hotplug: prefer to use reserved pages for migration · 4db9b2ef
      Michal Hocko authored
      new_node_page will try to use the origin's next NUMA node as the
      migration destination for hugetlb pages.  If such a node doesn't have
      any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
      to allocate a surplus page instead.  This is quite subotpimal for any
      configuration when hugetlb pages are no distributed to all NUMA nodes
      evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
      node 0
      Now we consume the whole pool on node 4 and try to offline this node.
      All the allocated pages should be moved to node0 which has enough
      preallocated pages to hold them.  With the current implementation
      offlining very likely fails because hugetlb allocations during runtime
      are much less reliable.
      Fix this by reusing the nodemask which excludes migration source and try
      to find a first node which has a page in the preallocated pool first and
      fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
      [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
      Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Liam R. Howlett's avatar
      mm/hugetlb.c: warn the user when issues arise on boot due to hugepages · d715cf80
      Liam R. Howlett authored
      When the user specifies too many hugepages or an invalid
      default_hugepagesz the communication to the user is implicit in the
      allocation message.  This patch adds a warning when the desired page
      count is not allocated and prints an error when the default_hugepagesz
      is invalid on boot.
      During boot hugepages will allocate until there is a fraction of the
      hugepage size left.  That is, we allocate until either the request is
      satisfied or memory for the pages is exhausted.  When memory for the
      pages is exhausted, it will most likely lead to the system failing with
      the OOM manager not finding enough (or anything) to kill (unless you're
      using really big hugepages in the order of 100s of MB or in the GBs).
      The user will most likely see the OOM messages much later in the boot
      sequence than the implicitly stated message.  Worse yet, you may even
      get an OOM for each processor which causes many pages of OOMs on modern
      systems.  Although these messages will be printed earlier than the OOM
      messages, at least giving the user errors and warnings will highlight
      the configuration as an issue.  I'm trying to point the user in the
      right direction by providing a more robust statement of what is failing.
      During the sysctl or echo command, the user can check the results much
      easier than if the system hangs during boot and the scenario of having
      nothing to OOM for kernel memory is highly unlikely.
      Mike said:
       "Before sending out this patch, I asked Liam off list why he was doing
        it. Was it something he just thought would be useful? Or, was there
        some type of user situation/need. He said that he had been called in
        to assist on several occasions when a system OOMed during boot. In
        almost all of these situations, the user had grossly misconfigured
        huge pages.
        DB users want to pre-allocate just the right amount of huge pages, but
        sometimes they can be really off. In such situations, the huge page
        init code just allocates as many huge pages as it can and reports the
        number allocated. There is no indication that it quit allocating
        because it ran out of memory. Of course, a user could compare the
        number in the message to what they requested on the command line to
        determine if they got all the huge pages they requested. The thought
        was that it would be useful to at least flag this situation. That way,
        the user might be able to better relate the huge page allocation
        failure to the OOM.
        I'm not sure if the e-mail discussion made it obvious that this is
        something he has seen on several occasions.
        I see Michal's point that this will only flag the situation where
        someone configures huge pages very badly. And, a more extensive look
        at the situation of misconfiguring huge pages might be in order. But,
        this has happened on several occasions which led to the creation of
        this patch"
      [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration]
      Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.com
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Naoya Horiguchi's avatar
    • Anshuman Khandual's avatar
      mm: hugetlb: soft-offline: dissolve source hugepage after successful migration · c3114a84
      Anshuman Khandual authored
      Currently hugepage migrated by soft-offline (i.e.  due to correctable
      memory errors) is contained as a hugepage, which means many non-error
      pages in it are unreusable, i.e.  wasted.
      This patch solves this issue by dissolving source hugepages into buddy.
      As done in previous patch, PageHWPoison is set only on a head page of
      the error hugepage.  Then in dissoliving we move the PageHWPoison flag
      to the raw error page so that all healthy subpages return back to buddy.
      [arnd@arndb.de: fix warnings: replace some macros with inline functions]
        Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com
      Signed-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Naoya Horiguchi's avatar
      mm: hugetlb: prevent reuse of hwpoisoned free hugepages · 243abd5b
      Naoya Horiguchi authored
      Patch series "mm: hwpoison: fixlet for hugetlb migration".
      This patchset updates the hwpoison/hugetlb code to address 2 reported
      One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
      http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
      half was already fixed in mainline, and another half about hugetlb cases
      are solved in this series.
      Another issue is "narrow-down error affected region into a single 4kB
      page instead of a whole hugetlb page" issue, which was tried by Anshuman
      and I updated it to apply it more widely.
      This patch (of 9):
      We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
      as we did before.  So current dequeue_huge_page_node() doesn't work as
      intended because it still uses is_migrate_isolate_page() for this check.
      This patch fixes it with PageHWPoison flag.
      Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.com
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  2. 06 Jul, 2017 9 commits
  3. 02 Jun, 2017 1 commit
  4. 01 Apr, 2017 2 commits
    • Mike Kravetz's avatar
      mm/hugetlb.c: don't call region_abort if region_chg fails · ff8c0c53
      Mike Kravetz authored
      Changes to hugetlbfs reservation maps is a two step process.  The first
      step is a call to region_chg to determine what needs to be changed, and
      prepare that change.  This should be followed by a call to call to
      region_add to commit the change, or region_abort to abort the change.
      The error path in hugetlb_reserve_pages called region_abort after a
      failed call to region_chg.  As a result, the adds_in_progress counter in
      the reservation map is off by 1.  This is caught by a VM_BUG_ON in
      resv_map_release when the reservation map is freed.
      syzkaller fuzzer (when using an injected kmalloc failure) found this
      bug, that resulted in the following:
       kernel BUG at mm/hugetlb.c:742!
       Call Trace:
        hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
        evict+0x481/0x920 fs/inode.c:553
        iput_final fs/inode.c:1515 [inline]
        iput+0x62b/0xa20 fs/inode.c:1542
        hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
        newseg+0x422/0xd30 ipc/shm.c:575
        ipcget_new ipc/util.c:285 [inline]
        ipcget+0x21e/0x580 ipc/util.c:639
        SYSC_shmget ipc/shm.c:673 [inline]
        SyS_shmget+0x158/0x230 ipc/shm.c:657
       RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742
      Link: http://lkml.kernel.org/r/1490821682-23228-1-git-send-email-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Naoya Horiguchi's avatar
      mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd() · c9d398fa
      Naoya Horiguchi authored
      I found the race condition which triggers the following bug when
      move_pages() and soft offline are called on a single hugetlb page
          Soft offlining page 0x119400 at 0x700000000000
          BUG: unable to handle kernel paging request at ffffea0011943820
          IP: follow_huge_pmd+0x143/0x190
          PGD 7ffd2067
          PUD 7ffd1067
          PMD 0
              [61163.582052] Oops: 0000 [#1] SMP
          Modules linked in: binfmt_misc ppdev virtio_balloon parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
          CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P           OE   4.11.0-rc2-mm1+ #2
          Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
          RIP: 0010:follow_huge_pmd+0x143/0x190
          RSP: 0018:ffffc90004bdbcd0 EFLAGS: 00010202
          RAX: 0000000465003e80 RBX: ffffea0004e34d30 RCX: 00003ffffffff000
          RDX: 0000000011943800 RSI: 0000000000080001 RDI: 0000000465003e80
          RBP: ffffc90004bdbd18 R08: 0000000000000000 R09: ffff880138d34000
          R10: ffffea0004650000 R11: 0000000000c363b0 R12: ffffea0011943800
          R13: ffff8801b8d34000 R14: ffffea0000000000 R15: 000077ff80000000
          FS:  00007fc977710740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: ffffea0011943820 CR3: 000000007a746000 CR4: 00000000001406f0
          Call Trace:
          RIP: 0033:0x7fc976e03949
          RSP: 002b:00007ffe72221d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
          RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc976e03949
          RDX: 0000000000c22390 RSI: 0000000000001400 RDI: 0000000000005827
          RBP: 00007ffe72221e00 R08: 0000000000c2c3a0 R09: 0000000000000004
          R10: 0000000000c363b0 R11: 0000000000000246 R12: 0000000000400650
          R13: 00007ffe72221ee0 R14: 0000000000000000 R15: 0000000000000000
          Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
          RIP: follow_huge_pmd+0x143/0x190 RSP: ffffc90004bdbcd0
          CR2: ffffea0011943820
          ---[ end trace e4f81353a2d23232 ]---
          Kernel panic - not syncing: Fatal exception
          Kernel Offset: disabled
      This bug is triggered when pmd_present() returns true for non-present
      hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
      Using pmd_present() to determine present/non-present for hugetlb is not
      correct, because pmd_present() checks multiple bits (not only
      _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
      Fixes: e66f17ff ("mm/hugetlb: take page table lock in follow_huge_pmd()")
      Link: http://lkml.kernel.org/r/1490149898-20231-1-git-send-email-n-horiguchi@ah.jp.nec.com
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: <stable@vger.kernel.org>        [4.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  5. 09 Mar, 2017 1 commit
  6. 02 Mar, 2017 1 commit
  7. 25 Feb, 2017 2 commits
  8. 23 Feb, 2017 5 commits
  9. 11 Jan, 2017 1 commit
  10. 13 Dec, 2016 4 commits
  11. 11 Nov, 2016 1 commit
  12. 08 Oct, 2016 5 commits