Skip to content
Snippets Groups Projects
  1. Jul 24, 2024
    • Joel Granados's avatar
      sysctl: treewide: constify the ctl_table argument of proc_handlers · 78eb4ea2
      Joel Granados authored
      
      const qualify the struct ctl_table argument in the proc_handler function
      signatures. This is a prerequisite to moving the static ctl_table
      structs into .rodata data which will ensure that proc_handler function
      pointers cannot be modified.
      
      This patch has been generated by the following coccinelle script:
      
      ```
        virtual patch
      
        @r1@
        identifier ctl, write, buffer, lenp, ppos;
        identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int write, void *buffer, size_t *lenp, loff_t *ppos);
      
        @r2@
        identifier func, ctl, write, buffer, lenp, ppos;
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int write, void *buffer, size_t *lenp, loff_t *ppos)
        { ... }
      
        @r3@
        identifier func;
        @@
      
        int func(
        - struct ctl_table *
        + const struct ctl_table *
          ,int , void *, size_t *, loff_t *);
      
        @r4@
        identifier func, ctl;
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int , void *, size_t *, loff_t *);
      
        @r5@
        identifier func, write, buffer, lenp, ppos;
        @@
      
        int func(
        - struct ctl_table *
        + const struct ctl_table *
          ,int write, void *buffer, size_t *lenp, loff_t *ppos);
      
      ```
      
      * Code formatting was adjusted in xfs_sysctl.c to comply with code
        conventions. The xfs_stats_clear_proc_handler,
        xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
        adjusted.
      
      * The ctl_table argument in proc_watchdog_common was const qualified.
        This is called from a proc_handler itself and is calling back into
        another proc_handler, making it necessary to change it as part of the
        proc_handler migration.
      
      Co-developed-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Signed-off-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Co-developed-by: default avatarJoel Granados <j.granados@samsung.com>
      Signed-off-by: default avatarJoel Granados <j.granados@samsung.com>
      78eb4ea2
  2. Jul 19, 2024
    • Jason A. Donenfeld's avatar
      mm: add MAP_DROPPABLE for designating always lazily freeable mappings · 9651fced
      Jason A. Donenfeld authored
      
      The vDSO getrandom() implementation works with a buffer allocated with a
      new system call that has certain requirements:
      
      - It shouldn't be written to core dumps.
        * Easy: VM_DONTDUMP.
      - It should be zeroed on fork.
        * Easy: VM_WIPEONFORK.
      
      - It shouldn't be written to swap.
        * Uh-oh: mlock is rlimited.
        * Uh-oh: mlock isn't inherited by forks.
      
      - It shouldn't reserve actual memory, but it also shouldn't crash when
        page faulting in memory if none is available
        * Uh-oh: VM_NORESERVE means segfaults.
      
      It turns out that the vDSO getrandom() function has three really nice
      characteristics that we can exploit to solve this problem:
      
      1) Due to being wiped during fork(), the vDSO code is already robust to
         having the contents of the pages it reads zeroed out midway through
         the function's execution.
      
      2) In the absolute worst case of whatever contingency we're coding for,
         we have the option to fallback to the getrandom() syscall, and
         everything is fine.
      
      3) The buffers the function uses are only ever useful for a maximum of
         60 seconds -- a sort of cache, rather than a long term allocation.
      
      These characteristics mean that we can introduce VM_DROPPABLE, which
      has the following semantics:
      
      a) It never is written out to swap.
      b) Under memory pressure, mm can just drop the pages (so that they're
         zero when read back again).
      c) It is inherited by fork.
      d) It doesn't count against the mlock budget, since nothing is locked.
      e) If there's not enough memory to service a page fault, it's not fatal,
         and no signal is sent.
      
      This way, allocations used by vDSO getrandom() can use:
      
          VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
      
      And there will be no problem with OOMing, crashing on overcommitment,
      using memory when not in use, not wiping on fork(), coredumps, or
      writing out to swap.
      
      In order to let vDSO getrandom() use this, expose these via mmap(2) as
      MAP_DROPPABLE.
      
      Note that this involves removing the MADV_FREE special case from
      sort_folio(), which according to Yu Zhao is unnecessary and will simply
      result in an extra call to shrink_folio_list() in the worst case. The
      chunk removed reenables the swapbacked flag, which we don't want for
      VM_DROPPABLE, and we can't conditionalize it here because there isn't a
      vma reference available.
      
      Finally, the provided self test ensures that this is working as desired.
      
      Cc: linux-mm@kvack.org
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      9651fced
  3. Jul 18, 2024
    • Yu Zhao's avatar
      mm/mglru: fix ineffective protection calculation · 30d77b7e
      Yu Zhao authored
      mem_cgroup_calculate_protection() is not stateless and should only be used
      as part of a top-down tree traversal.  shrink_one() traverses the per-node
      memcg LRU instead of the root_mem_cgroup tree, and therefore it should not
      call mem_cgroup_calculate_protection().
      
      The existing misuse in shrink_one() can cause ineffective protection of
      sub-trees that are grandchildren of root_mem_cgroup.  Fix it by reusing
      lru_gen_age_node(), which already traverses the root_mem_cgroup tree, to
      calculate the protection.
      
      Previously lru_gen_age_node() opportunistically skips the first pass,
      i.e., when scan_control->priority is DEF_PRIORITY.  On the second pass,
      lruvec_is_sizable() uses appropriate scan_control->priority, set by
      set_initial_priority() from lru_gen_shrink_node(), to decide whether a
      memcg is too small to reclaim from.
      
      Now lru_gen_age_node() unconditionally traverses the root_mem_cgroup tree.
      So it should call set_initial_priority() upfront, to make sure
      lruvec_is_sizable() uses appropriate scan_control->priority on the first
      pass.  Otherwise, lruvec_is_reclaimable() can return false negatives and
      result in premature OOM kills when min_ttl_ms is used.
      
      Link: https://lkml.kernel.org/r/20240712232956.1427127-1-yuzhao@google.com
      
      
      Fixes: e4dde56c ("mm: multi-gen LRU: per-node lru_gen_folio lists")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarT.J. Mercier <tjmercier@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30d77b7e
    • Dan Carpenter's avatar
      mm/zswap: fix a white space issue · b749cb0d
      Dan Carpenter authored
      We accidentally deleted a tab in commit f84152e9efc5 ("mm/zswap: use only
      one pool in zswap").  Add it back.
      
      Link: https://lkml.kernel.org/r/c15066a0-f061-42c9-b0f5-d60281d3d5d8@stanley.mountain
      
      
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b749cb0d
    • Miaohe Lin's avatar
      mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio · 1390a333
      Miaohe Lin authored
      A kernel crash was observed when migrating hugetlb folio:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000008
      PGD 0 P4D 0
      Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
      CPU: 0 PID: 3435 Comm: bash Not tainted 6.10.0-rc6-00450-g8578ca01f21f #66
      RIP: 0010:__folio_undo_large_rmappable+0x70/0xb0
      RSP: 0018:ffffb165c98a7b38 EFLAGS: 00000097
      RAX: fffffbbc44528090 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffffa30e000a2800 RSI: 0000000000000246 RDI: ffffa3153ffffcc0
      RBP: fffffbbc44528000 R08: 0000000000002371 R09: ffffffffbe4e5868
      R10: 0000000000000001 R11: 0000000000000001 R12: ffffa3153ffffcc0
      R13: fffffbbc44468000 R14: 0000000000000001 R15: 0000000000000001
      FS:  00007f5b3a716740(0000) GS:ffffa3151fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 000000010959a000 CR4: 00000000000006f0
      Call Trace:
       <TASK>
       __folio_migrate_mapping+0x59e/0x950
       __migrate_folio.constprop.0+0x5f/0x120
       move_to_new_folio+0xfd/0x250
       migrate_pages+0x383/0xd70
       soft_offline_page+0x2ab/0x7f0
       soft_offline_page_store+0x52/0x90
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x380/0x540
       ksys_write+0x64/0xe0
       do_syscall_64+0xb9/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7f5b3a514887
      RSP: 002b:00007ffe138fce68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f5b3a514887
      RDX: 000000000000000c RSI: 0000556ab809ee10 RDI: 0000000000000001
      RBP: 0000556ab809ee10 R08: 00007f5b3a5d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007f5b3a61b780 R14: 00007f5b3a617600 R15: 00007f5b3a616a00
      
      It's because hugetlb folio is passed to __folio_undo_large_rmappable()
      unexpectedly.  large_rmappable flag is imperceptibly set to hugetlb folio
      since commit f6a8dd98 ("hugetlb: convert alloc_buddy_hugetlb_folio to
      use a folio").  Then commit be9581ea ("mm: fix crashes from deferred
      split racing folio migration") makes folio_migrate_mapping() call
      folio_undo_large_rmappable() triggering the bug.  Fix this issue by
      clearing large_rmappable flag for hugetlb folios.  They don't need that
      flag set anyway.
      
      Link: https://lkml.kernel.org/r/20240709120433.4136700-1-linmiaohe@huawei.com
      
      
      Fixes: f6a8dd98 ("hugetlb: convert alloc_buddy_hugetlb_folio to use a folio")
      Fixes: be9581ea ("mm: fix crashes from deferred split racing folio migration")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1390a333
    • Miaohe Lin's avatar
      mm/hugetlb: fix possible recursive locking detected warning · 667574e8
      Miaohe Lin authored
      When tries to demote 1G hugetlb folios, a lockdep warning is observed:
      
      ============================================
      WARNING: possible recursive locking detected
      6.10.0-rc6-00452-ga4d0275fa660-dirty #79 Not tainted
      --------------------------------------------
      bash/710 is trying to acquire lock:
      ffffffff8f0a7850 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0x244/0x460
      
      but task is already holding lock:
      ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&h->resize_lock);
        lock(&h->resize_lock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      4 locks held by bash/710:
       #0: ffff8f118439c3f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
       #1: ffff8f11893b9e88 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
       #2: ffff8f1183dc4428 (kn->active#98){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
       #3: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460
      
      stack backtrace:
      CPU: 3 PID: 710 Comm: bash Not tainted 6.10.0-rc6-00452-ga4d0275fa660-dirty #79
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x68/0xa0
       __lock_acquire+0x10f2/0x1ca0
       lock_acquire+0xbe/0x2d0
       __mutex_lock+0x6d/0x400
       demote_store+0x244/0x460
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x380/0x540
       ksys_write+0x64/0xe0
       do_syscall_64+0xb9/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7fa61db14887
      RSP: 002b:00007ffc56c48358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa61db14887
      RDX: 0000000000000002 RSI: 000055a030050220 RDI: 0000000000000001
      RBP: 000055a030050220 R08: 00007fa61dbd1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
      R13: 00007fa61dc1b780 R14: 00007fa61dc17600 R15: 00007fa61dc16a00
       </TASK>
      
      Lockdep considers this an AA deadlock because the different resize_lock
      mutexes reside in the same lockdep class, but this is a false positive.
      Place them in distinct classes to avoid these warnings.
      
      Link: https://lkml.kernel.org/r/20240712031314.2570452-1-linmiaohe@huawei.com
      
      
      Fixes: 8531fc6f ("hugetlb: add hugetlb demote page support")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarMuchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      667574e8
    • yangge's avatar
      mm/gup: clear the LRU flag of a page before adding to LRU batch · 33dfe920
      yangge authored
      If a large number of CMA memory are configured in system (for example,
      the CMA memory accounts for 50% of the system memory), starting a
      virtual virtual machine with device passthrough, it will call
      pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory.  Normally
      if a page is present and in CMA area, pin_user_pages_remote() will
      migrate the page from CMA area to non-CMA area because of FOLL_LONGTERM
      flag.  But the current code will cause the migration failure due to
      unexpected page refcounts, and eventually cause the virtual machine
      fail to start.
      
      If a page is added in LRU batch, its refcount increases one, remove the
      page from LRU batch decreases one.  Page migration requires the page is
      not referenced by others except page mapping.  Before migrating a page,
      we should try to drain the page from LRU batch in case the page is in
      it, however, folio_test_lru() is not sufficient to tell whether the
      page is in LRU batch or not, if the page is in LRU batch, the migration
      will fail.
      
      To solve the problem above, we modify the logic of adding to LRU batch.
      Before adding a page to LRU batch, we clear the LRU flag of the page
      so that we can check whether the page is in LRU batch by
      folio_test_lru(page).  It's quite valuable, because likely we don't
      want to blindly drain the LRU batch simply because there is some
      unexpected reference on a page, as described above.
      
      This change makes the LRU flag of a page invisible for longer, which
      may impact some programs.  For example, as long as a page is on a LRU
      batch, we cannot isolate it, and we cannot check if it's an LRU page. 
      Further, a page can now only be on exactly one LRU batch.  This doesn't
      seem to matter much, because a new page is allocated from buddy and
      added to the lru batch, or be isolated, it's LRU flag may also be
      invisible for a long time.
      
      Link: https://lkml.kernel.org/r/1720075944-27201-1-git-send-email-yangge1116@126.com
      Link: https://lkml.kernel.org/r/1720008153-16035-1-git-send-email-yangge1116@126.com
      
      
      Fixes: 9a4e9f3b ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
      Signed-off-by: default avataryangge <yangge1116@126.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      33dfe920
    • Tvrtko Ursulin's avatar
      mm/numa_balancing: teach mpol_to_str about the balancing mode · af649773
      Tvrtko Ursulin authored
      Since balancing mode was added in bda420b9 ("numa balancing: migrate
      on fault among multiple bound nodes"), it was possible to set this mode
      but it wouldn't be shown in /proc/<pid>/numa_maps since there was no
      support for it in the mpol_to_str() helper.
      
      Furthermore, because the balancing mode sets the MPOL_F_MORON flag, it
      would be displayed as 'default' due a workaround introduced a few years
      earlier in 8790c71a ("mm/mempolicy.c: fix mempolicy printing in
      numa_maps").
      
      To tidy this up we implement two changes:
      
      Replace the MPOL_F_MORON check by pointer comparison against the
      preferred_node_policy array.  By doing this we generalise the current
      special casing and replace the incorrect 'default' with the correct 'bind'
      for the mode.
      
      Secondly, we add a string representation and corresponding handling for
      the MPOL_F_NUMA_BALANCING flag.
      
      With the two changes together we start showing the balancing flag when it
      is set and therefore complete the fix.
      
      Representation format chosen is to separate multiple flags with vertical
      bars, following what existed long time ago in kernel 2.6.25.  But as
      between then and now there wasn't a way to display multiple flags, this
      patch does not change the format in practice.
      
      Some /proc/<pid>/numa_maps output examples:
      
       555559580000 bind=balancing:0-1,3 file=...
       555585800000 bind=balancing|static:0,2 file=...
       555635240000 prefer=relative:0 file=
      
      Link: https://lkml.kernel.org/r/20240708075632.95857-1-tursulin@igalia.com
      
      
      Signed-off-by: default avatarTvrtko Ursulin <tvrtko.ursulin@igalia.com>
      Fixes: bda420b9 ("numa balancing: migrate on fault among multiple bound nodes")
      References: 8790c71a ("mm/mempolicy.c: fix mempolicy printing in numa_maps")
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      af649773
    • Roman Gushchin's avatar
      mm: memcg1: convert charge move flags to unsigned long long · 5316b497
      Roman Gushchin authored
      Currently MOVE_ANON and MOVE_FILE flags are defined as integers
      and it leads to the following Smatch static checker warning:
          mm/memcontrol-v1.c:609 mem_cgroup_move_charge_write()
          warn: was expecting a 64 bit value instead of '~(1 | 2)'
      
      Fix this be redefining them as unsigned long long.
      
      Even though the issue allows to set high 32 bits of mc.flags
      to an arbitrary number, these bits are never used, so it doesn't
      have any significant consequences.
      
      Link: https://lkml.kernel.org/r/ZpF8Q9zBsIY7d2P9@google.com
      
      
      Signed-off-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5316b497
    • Yu Zhao's avatar
      mm/mglru: fix overshooting shrinker memory · 3f74e6bd
      Yu Zhao authored
      set_initial_priority() tries to jump-start global reclaim by estimating
      the priority based on cold/hot LRU pages.  The estimation does not account
      for shrinker objects, and it cannot do so because their sizes can be in
      different units other than page.
      
      If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0 where
      ZFS ARC can use almost all system memory, set_initial_priority() can
      vastly underestimate how much memory ARC shrinker can evict and assign
      extreme low values to scan_control->priority, resulting in overshoots of
      shrinker objects.
      
      To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM, a
      test ZFS pool and the following commands:
      
        fio --name=mglru.file --numjobs=36 --ioengine=io_uring \
            --directory=/root/test-zfs-pool/ --size=1024m --buffered=1 \
            --rw=randread --random_distribution=random \
            --time_based --runtime=1h &
      
        for ((i = 0; i < 20; i++))
        do
          sleep 120
          fio --name=mglru.anon --numjobs=16 --ioengine=mmap \
            --filename=/dev/zero --size=1024m --fadvise_hint=0 \
            --rw=randrw --random_distribution=random \
            --time_based --runtime=1m
        done
      
      To fix the problem:
      1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent
         the jump-start from being overly aggressive.
      2. Account for the progress from mm_account_reclaimed_pages(), to
         prevent kswapd_shrink_node() from raising the priority
         unnecessarily.
      
      Link: https://lkml.kernel.org/r/20240711191957.939105-2-yuzhao@google.com
      
      
      Fixes: e4dde56c ("mm: multi-gen LRU: per-node lru_gen_folio lists")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarAlexander Motin <mav@ixsystems.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3f74e6bd
    • Yu Zhao's avatar
      mm/mglru: fix div-by-zero in vmpressure_calc_level() · 8b671fe1
      Yu Zhao authored
      evict_folios() uses a second pass to reclaim folios that have gone through
      page writeback and become clean before it finishes the first pass, since
      folio_rotate_reclaimable() cannot handle those folios due to the
      isolation.
      
      The second pass tries to avoid potential double counting by deducting
      scan_control->nr_scanned.  However, this can result in underflow of
      nr_scanned, under a condition where shrink_folio_list() does not increment
      nr_scanned, i.e., when folio_trylock() fails.
      
      The underflow can cause the divisor, i.e., scale=scanned+reclaimed in
      vmpressure_calc_level(), to become zero, resulting in the following crash:
      
        [exception RIP: vmpressure_work_fn+101]
        process_one_work at ffffffffa3313f2b
      
      Since scan_control->nr_scanned has no established semantics, the potential
      double counting has minimal risks.  Therefore, fix the problem by not
      deducting scan_control->nr_scanned in evict_folios().
      
      Link: https://lkml.kernel.org/r/20240711191957.939105-1-yuzhao@google.com
      
      
      Fixes: 359a5e14 ("mm: multi-gen LRU: retry folios written back while isolated")
      Reported-by: default avatarWei Xu <weixugc@google.com>
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Alexander Motin <mav@ixsystems.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8b671fe1
    • Kees Cook's avatar
      mm/kmemleak: replace strncpy() with strscpy() · 0b847801
      Kees Cook authored
      Replace the depreciated[1] strncpy() calls with strscpy().  Uses of
      object->comm do not depend on the padding side-effect.
      
      Link: https://github.com/KSPP/linux/issues/90 [1]
      Link: https://lkml.kernel.org/r/20240710001300.work.004-kees@kernel.org
      
      
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b847801
    • Vlastimil Babka's avatar
      mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC · 53dabce2
      Vlastimil Babka authored
      This mostly reverts commit af3b8544 ("mm/page_alloc.c: allow error
      injection").  The commit made should_fail_alloc_page() a noinline function
      that's always called from the page allocation hotpath, even if it's empty
      because CONFIG_FAIL_PAGE_ALLOC is not enabled, and there is no option to
      disable it and prevent the associated function call overhead.
      
      As with the preceding patch "mm, slab: put should_failslab back behind
      CONFIG_SHOULD_FAILSLAB" and for the same reasons, put the
      should_fail_alloc_page() back behind the config option.  When enabled, the
      ALLOW_ERROR_INJECTION and BTF_ID records are preserved so it's not a
      complete revert.
      
      Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-2-9e2651945d68@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@fomichev.me>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53dabce2
    • Vlastimil Babka's avatar
      mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB · a7526fe8
      Vlastimil Babka authored
      Patch series "revert unconditional slab and page allocator fault injection
      calls".
      
      These two patches largely revert commits that added function call overhead
      into slab and page allocation hotpaths and that cannot be currently
      disabled even though related CONFIG_ options do exist.
      
      A much more involved solution that can keep the callsites always existing
      but hidden behind a static key if unused, is possible [1] and can be
      pursued by anyone who believes it's necessary.  Meanwhile the fact the
      should_failslab() error injection is already not functional on kernels
      built with current gcc without anyone noticing [2], and lukewarm response
      to [1] suggests the need is not there.  I believe it will be more fair to
      have the state after this series as a baseline for possible further
      optimisation, instead of the unconditional overhead.
      
      For example a possible compromise for anyone who's fine with an empty
      function call overhead but not the full CONFIG_FAILSLAB /
      CONFIG_FAIL_PAGE_ALLOC overhead is to reuse patch 1 from [1] but insert a
      static key check only inside should_failslab() and
      should_fail_alloc_page() before performing the more expensive checks.
      
      [1] https://lore.kernel.org/all/20240620-fault-injection-statickeys-v2-0-e23947d3d84b@suse.cz/#t
      [2] https://github.com/bpftrace/bpftrace/issues/3258
      
      
      This patch (of 2):
      
      This mostly reverts commit 4f6923fb ("mm: make should_failslab always
      available for fault injection").  The commit made should_failslab() a
      noinline function that's always called from the slab allocation hotpath,
      even if it's empty because CONFIG_SHOULD_FAILSLAB is not enabled, and
      there is no option to disable that call.  This is visible in profiles and
      the function call overhead can be noticeable especially with cpu
      mitigations.
      
      Meanwhile the bpftrace program example in the commit silently does not
      work without CONFIG_SHOULD_FAILSLAB anyway with a recent gcc, because the
      empty function gets a .constprop clone that is actually being called
      (uselessly) from the slab hotpath, while the error injection is hooked to
      the original function that's not being called at all [1].
      
      Thus put the whole should_failslab() function back behind
      CONFIG_SHOULD_FAILSLAB.  It's not a complete revert of 4f6923fb - the
      int return type that returns -ENOMEM on failure is preserved, as well
      ALLOW_ERROR_INJECTION annotation.  The BTF_ID() record that was meanwhile
      added is also guarded by CONFIG_SHOULD_FAILSLAB.
      
      [1] https://github.com/bpftrace/bpftrace/issues/3258
      
      Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-0-9e2651945d68@suse.cz
      Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-1-9e2651945d68@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@fomichev.me>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7526fe8
    • Pei Li's avatar
      mm: ignore data-race in __swap_writepage · 7b7aca6d
      Pei Li authored
      Syzbot reported a possible data race:
      
      BUG: KCSAN: data-race in __swap_writepage / scan_swap_map_slots
      
      read-write to 0xffff888102fca610 of 8 bytes by task 7106 on cpu 1.
      read to 0xffff888102fca610 of 8 bytes by task 7080 on cpu 0.
      
      While we are in __swap_writepage to read sis->flags, scan_swap_map_slots
      is trying to update it with SWP_SCANNING.
      
      value changed: 0x0000000000008083 -> 0x0000000000004083.
      
      While this can be updated non-atomicially, this won't affect
      SWP_SYNCHRONOUS_IO, so we consider this data-race safe.
      
      This is possibly introduced by commit 3222d8c2 ("block: remove
      ->rw_page"), where this if branch is introduced.
      
      Link: https://lkml.kernel.org/r/20240711-bug13-v1-1-cea2b8ae8d76@gmail.com
      
      
      Fixes: 3222d8c2 ("block: remove ->rw_page")
      Signed-off-by: default avatarPei Li <peili.dev@gmail.com>
      Reported-by: default avatar <syzbot+da25887cc13da6bf3b8c@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=da25887cc13da6bf3b8c
      
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7b7aca6d
  4. Jul 15, 2024
    • Alex Shi (Tencent)'s avatar
      mm/memcg: alignment memcg_data define condition · a52c6330
      Alex Shi (Tencent) authored
      
      commit 21c690a3 ("mm: introduce slabobj_ext to support slab object
      extensions") changed the folio/page->memcg_data define condition from
      MEMCG to SLAB_OBJ_EXT. This action make memcg_data exposed while !MEMCG.
      
      As Vlastimil Babka suggested, let's add _unused_slab_obj_exts for
      SLAB_MATCH for slab.obj_exts while !MEMCG. That could resolve the match
      issue, clean up the feature logical.
      
      Signed-off-by: default avatarAlex Shi (Tencent) <alexs@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Yoann Congal <yoann.congal@smile.fr>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      a52c6330
  5. Jul 12, 2024
Loading