Skip to content
Snippets Groups Projects
  1. Nov 23, 2022
    • Yang Shi's avatar
      mm: khugepaged: allow page allocation fallback to eligible nodes · e031ff96
      Yang Shi authored
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      The khugepaged code would pick up the node with the most hit as the preferred
      node, and also tries to do some balance if several nodes have the same
      hit record.  Basically it does conceptually:
          * If the target_node <= last_target_node, then iterate from
      last_target_node + 1 to MAX_NUMNODES (1024 on default config)
          * If the max_value == node_load[nid], then target_node = nid
      
      But there is a corner case, paritucularly for MADV_COLLAPSE, that the
      non-existing node may be returned as preferred node.
      
      Assuming the system has 2 nodes, the target_node is 0 and the
      last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
      be 0, then it may return 2 for target_node, but it is actually not
      existing (offline), so the warn is triggered.
      
      The node balance was introduced by commit 9f1b868a ("mm: thp:
      khugepaged: add policy for finding target node") to satisfy
      "numactl --interleave=all".  But interleaving is a mere hint rather than
      something that has hard requirements.
      
      So use nodemask to record the nodes which have the same hit record, the
      hugepage allocation could fallback to those nodes.  And remove
      __GFP_THISNODE since it does disallow fallback.  And if the nodemask
      just has one node set, it means there is one single node has the most
      hit record, the nodemask approach actually behaves like __GFP_THISNODE.
      
      Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com
      
      
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Suggested-by: default avatarZach O'Keefe <zokeefe@google.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatar <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e031ff96
    • Johannes Weiner's avatar
      mm: vmscan: fix extreme overreclaim and swap floods · f53af428
      Johannes Weiner authored
      During proactive reclaim, we sometimes observe severe overreclaim, with
      several thousand times more pages reclaimed than requested.
      
      This trace was obtained from shrink_lruvec() during such an instance:
      
          prio:0 anon_cost:1141521 file_cost:7767
          nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
          nr=[7161123 345 578 1111]
      
      While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
      by swapping.  These requests take over a minute, during which the write()
      to memory.reclaim is unkillably stuck inside the kernel.
      
      Digging into the source, this is caused by the proportional reclaim
      bailout logic.  This code tries to resolve a fundamental conflict: to
      reclaim roughly what was requested, while also aging all LRUs fairly and
      in accordance to their size, swappiness, refault rates etc.  The way it
      attempts fairness is that once the reclaim goal has been reached, it stops
      scanning the LRUs with the smaller remaining scan targets, and adjusts the
      remainder of the bigger LRUs according to how much of the smaller LRUs was
      scanned.  It then finishes scanning that remainder regardless of the
      reclaim goal.
      
      This works fine if priority levels are low and the LRU lists are
      comparable in size.  However, in this instance, the cgroup that is
      targeted by proactive reclaim has almost no files left - they've already
      been squeezed out by proactive reclaim earlier - and the remaining anon
      pages are hot.  Anon rotations cause the priority level to drop to 0,
      which results in reclaim targeting all of anon (a lot) and all of file
      (almost nothing).  By the time reclaim decides to bail, it has scanned
      most or all of the file target, and therefor must also scan most or all of
      the enormous anon target.  This target is thousands of times larger than
      the reclaim goal, thus causing the overreclaim.
      
      The bailout code hasn't changed in years, why is this failing now?  The
      most likely explanations are two other recent changes in anon reclaim:
      
      1. Before the series starting with commit 5df74196 ("mm: fix LRU
         balancing effect of new transparent huge pages"), the VM was
         overall relatively reluctant to swap at all, even if swap was
         configured. This means the LRU balancing code didn't come into play
         as often as it does now, and mostly in high pressure situations
         where pronounced swap activity wouldn't be as surprising.
      
      2. For historic reasons, shrink_lruvec() loops on the scan targets of
         all LRU lists except the active anon one, meaning it would bail if
         the only remaining pages to scan were active anon - even if there
         were a lot of them.
      
         Before the series starting with commit ccc5dc67 ("mm/vmscan:
         make active/inactive ratio as 1:1 for anon lru"), most anon pages
         would live on the active LRU; the inactive one would contain only a
         handful of preselected reclaim candidates. After the series, anon
         gets aged similarly to file, and the inactive list is the default
         for new anon pages as well, making it often the much bigger list.
      
         As a result, the VM is now more likely to actually finish large
         anon targets than before.
      
      Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
      larger LRU lists is made before bailing out on a met reclaim goal.
      
      This fixes the extreme overreclaim problem.
      
      Fairness is more subtle and harder to evaluate.  No obvious misbehavior
      was observed on the test workload, in any case.  Conceptually, fairness
      should primarily be a cumulative effect from regular, lower priority
      scans.  Once the VM is in trouble and needs to escalate scan targets to
      make forward progress, fairness needs to take a backseat.  This is also
      acknowledged by the myriad exceptions in get_scan_count().  This patch
      makes fairness decrease gradually, as it keeps fairness work static over
      increasing priority levels with growing scan targets.  This should make
      more sense - although we may have to re-visit the exact values.
      
      Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f53af428
  2. Nov 08, 2022
  3. Oct 28, 2022
  4. Oct 21, 2022
  5. Oct 15, 2022
  6. Oct 13, 2022
    • Ira Weiny's avatar
      highmem: fix kmap_to_page() for kmap_local_page() addresses · ef6e06b2
      Ira Weiny authored
      kmap_to_page() is used to get the page for a virtual address which may
      be kmap'ed.  Unfortunately, kmap_local_page() stores mappings in a
      thread local array separate from kmap().  These mappings were not
      checked by the call.
      
      Check the kmap_local_page() mappings and return the page if found.
      
      Because it is intended to remove kmap_to_page() add a warn on once to
      the kmap checks to flag potential issues early.
      
      NOTE Due to 32bit x86 use of kmap local in iomap atmoic, KMAP_LOCAL does
      not require HIGHMEM to be set.  Therefore the support calls required a
      new KMAP_LOCAL section to fix 0day build errors.
      
      [akpm@linux-foundation.org: fix warning]
      Link: https://lkml.kernel.org/r/20221006040555.1502679-1-ira.weiny@intel.com
      
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef6e06b2
    • Yafang Shao's avatar
      mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page · 15cd9004
      Yafang Shao authored
      PGFREE and PGALLOC represent the number of freed and allocated pages.  So
      the page order must be considered.
      
      Link: https://lkml.kernel.org/r/20221006101540.40686-1-laoar.shao@gmail.com
      
      
      Fixes: 44042b44 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15cd9004
    • Peter Xu's avatar
      mm/hugetlb: use hugetlb_pte_stable in migration race check · f9bf6c03
      Peter Xu authored
      After hugetlb_pte_stable() introduced, we can also rewrite the migration
      race condition against page allocation to use the new helper too.
      
      Link: https://lkml.kernel.org/r/20221004193400.110155-3-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f9bf6c03
    • Peter Xu's avatar
      mm/hugetlb: fix race condition of uffd missing/minor handling · 2ea7ff1e
      Peter Xu authored
      Patch series "mm/hugetlb: Fix selftest failures with write check", v3.
      
      Currently akpm mm-unstable fails with uffd hugetlb private mapping test
      randomly on a write check.
      
      The initial bisection of that points to the recent pmd unshare series, but
      it turns out there's no direction relationship with the series but only
      some timing change caused the race to start trigger.
      
      The race should be fixed in patch 1.  Patch 2 is a trivial cleanup on the
      similar race with hugetlb migrations, patch 3 comment on the write check
      so when anyone read it again it'll be clear why it's there.
      
      
      This patch (of 3):
      
      After the recent rework patchset of hugetlb locking on pmd sharing,
      kselftest for userfaultfd sometimes fails on hugetlb private tests with
      unexpected write fault checks.
      
      It turns out there's nothing wrong within the locking series regarding
      this matter, but it could have changed the timing of threads so it can
      trigger an old bug.
      
      The real bug is when we call hugetlb_no_page() we're not with the pgtable
      lock.  It means we're reading the pte values lockless.  It's perfectly
      fine in most cases because before we do normal page allocations we'll take
      the lock and check pte_same() again.  However before that, there are
      actually two paths on userfaultfd missing/minor handling that may directly
      move on with the fault process without checking the pte values.
      
      It means for these two paths we may be generating an uffd message based on
      an unstable pte, while an unstable pte can legally be anything as long as
      the modifier holds the pgtable lock.
      
      One example, which is also what happened in the failing kselftest and
      caused the test failure, is that for private mappings wr-protection
      changes can happen on one page.  While hugetlb_change_protection()
      generally requires pte being cleared before being changed, then there can
      be a race condition like:
      
              thread 1                              thread 2
              --------                              --------
      
            UFFDIO_WRITEPROTECT                     hugetlb_fault
              hugetlb_change_protection
                pgtable_lock()
                huge_ptep_modify_prot_start
                                                    pte==NULL
                                                    hugetlb_no_page
                                                      generate uffd missing event
                                                      even if page existed!!
                huge_ptep_modify_prot_commit
                pgtable_unlock()
      
      Fix this by rechecking the pte after pgtable lock for both userfaultfd
      missing & minor fault paths.
      
      This bug should have been around starting from uffd hugetlb introduced, so
      attaching a Fixes to the commit.  Also attach another Fixes to the minor
      support commit for easier tracking.
      
      Note that userfaultfd is actually fine with false positives (e.g.  caused
      by pte changed), but not wrong logical events (e.g.  caused by reading a
      pte during changing).  The latter can confuse the userspace, so the
      strictness is very much preferred.  E.g., MISSING event should never
      happen on the page after UFFDIO_COPY has correctly installed the page and
      returned.
      
      Link: https://lkml.kernel.org/r/20221004193400.110155-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20221004193400.110155-2-peterx@redhat.com
      
      
      Fixes: 1a1aad8a ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
      Fixes: 7677f7fd ("userfaultfd: add minor fault registration mode")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Co-developed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ea7ff1e
    • Qi Zheng's avatar
      mm: use update_mmu_tlb() on the second thread · bce8cb3c
      Qi Zheng authored
      As message in commit 7df67697 ("mm/memory.c: Update local TLB if PTE
      entry exists") said, we should update local TLB only on the second thread.
      So in the do_anonymous_page() here, we should use update_mmu_tlb()
      instead of update_mmu_cache() on the second thread.
      
      As David pointed out, this is a performance improvement, not a
      correctness fix.
      
      Link: https://lkml.kernel.org/r/20220929112318.32393-2-zhengqi.arch@bytedance.com
      
      
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Bibo Mao <maobibo@loongson.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Huacai Chen <chenhuacai@loongson.cn>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bce8cb3c
    • Andrey Konovalov's avatar
      kasan: fix array-bounds warnings in tests · d6e5040b
      Andrey Konovalov authored
      GCC's -Warray-bounds option detects out-of-bounds accesses to
      statically-sized allocations in krealloc out-of-bounds tests.
      
      Use OPTIMIZER_HIDE_VAR to suppress the warning.
      
      Also change kmalloc_memmove_invalid_size to use OPTIMIZER_HIDE_VAR
      instead of a volatile variable.
      
      Link: https://lkml.kernel.org/r/e94399242d32e00bba6fd0d9ec4c897f188128e8.1664215688.git.andreyknvl@google.com
      
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d6e5040b
    • Alistair Popple's avatar
      mm/migrate_device.c: add migrate_device_range() · e778406b
      Alistair Popple authored
      Device drivers can use the migrate_vma family of functions to migrate
      existing private anonymous mappings to device private pages.  These pages
      are backed by memory on the device with drivers being responsible for
      copying data to and from device memory.
      
      Device private pages are freed via the pgmap->page_free() callback when
      they are unmapped and their refcount drops to zero.  Alternatively they
      may be freed indirectly via migration back to CPU memory in response to a
      pgmap->migrate_to_ram() callback called whenever the CPU accesses an
      address mapped to a device private page.
      
      In other words drivers cannot control the lifetime of data allocated on
      the devices and must wait until these pages are freed from userspace. 
      This causes issues when memory needs to reclaimed on the device, either
      because the device is going away due to a ->release() callback or because
      another user needs to use the memory.
      
      Drivers could use the existing migrate_vma functions to migrate data off
      the device.  However this would require them to track the mappings of each
      page which is both complicated and not always possible.  Instead drivers
      need to be able to migrate device pages directly so they can free up
      device memory.
      
      To allow that this patch introduces the migrate_device family of functions
      which are functionally similar to migrate_vma but which skips the initial
      lookup based on mapping.
      
      Link: https://lkml.kernel.org/r/868116aab70b0c8ee467d62498bb2cf0ef907295.1664366292.git-series.apopple@nvidia.com
      
      
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e778406b
    • Alistair Popple's avatar
      mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page() · 241f6885
      Alistair Popple authored
      migrate_device_coherent_page() reuses the existing migrate_vma family of
      functions to migrate a specific page without providing a valid mapping or
      vma.  This looks a bit odd because it means we are calling migrate_vma_*()
      without setting a valid vma, however it was considered acceptable at the
      time because the details were internal to migrate_device.c and there was
      only a single user.
      
      One of the reasons the details could be kept internal was that this was
      strictly for migrating device coherent memory.  Such memory can be copied
      directly by the CPU without intervention from a driver.  However this
      isn't true for device private memory, and a future change requires similar
      functionality for device private memory.  So refactor the code into
      something more sensible for migrating device memory without a vma.
      
      Link: https://lkml.kernel.org/r/c7b2ff84e9b33d022cf4a40f87d051f281a16d8f.1664366292.git-series.apopple@nvidia.com
      
      
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      241f6885
    • Alistair Popple's avatar
      mm/memremap.c: take a pgmap reference on page allocation · 0dc45ca1
      Alistair Popple authored
      ZONE_DEVICE pages have a struct dev_pagemap which is allocated by a
      driver.  When the struct page is first allocated by the kernel in
      memremap_pages() a reference is taken on the associated pagemap to ensure
      it is not freed prior to the pages being freed.
      
      Prior to 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page
      refcount") pages were considered free and returned to the driver when the
      reference count dropped to one.  However the pagemap reference was not
      dropped until the page reference count hit zero.  This would occur as part
      of the final put_page() in memunmap_pages() which would wait for all pages
      to be freed prior to returning.
      
      When the extra refcount was removed the pagemap reference was no longer
      being dropped in put_page().  Instead memunmap_pages() was changed to
      explicitly drop the pagemap references.  This means that memunmap_pages()
      can complete even though pages are still mapped by the kernel which can
      lead to kernel crashes, particularly if a driver frees the pagemap.
      
      To fix this drivers should take a pagemap reference when allocating the
      page.  This reference can then be returned when the page is freed.
      
      Link: https://lkml.kernel.org/r/12d155ec727935ebfbb4d639a03ab374917ea51b.1664366292.git-series.apopple@nvidia.com
      
      
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Fixes: 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page refcount")
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0dc45ca1
Loading