Skip to content
Snippets Groups Projects
  1. Jun 09, 2023
    • Yosry Ahmed's avatar
      writeback: move wb_over_bg_thresh() call outside lock section · 2816ea2a
      Yosry Ahmed authored
      Patch series "cgroup: eliminate atomic rstat flushing", v5.
      
      A previous patch series [1] changed most atomic rstat flushing contexts to
      become non-atomic.  This was done to avoid an expensive operation that
      scales with # cgroups and # cpus to happen with irqs disabled and
      scheduling not permitted.  There were two remaining atomic flushing
      contexts after that series.  This series tries to eliminate them as well,
      eliminating atomic rstat flushing completely.
      
      The two remaining atomic flushing contexts are:
      (a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
      (b) mem_cgroup_threshold()->mem_cgroup_usage()
      
      For (a), flushing needs to be atomic as wb_writeback() calls
      wb_over_bg_thresh() with a spinlock held.  However, it seems like the call
      to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
      this series proposes a refactoring that moves the call outside the lock
      criticial section and makes the stats flushing in mem_cgroup_wb_stats()
      non-atomic.
      
      For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
      with irqs disabled.  We only flush the stats when calculating the root
      usage, as it is approximated as the sum of some memcg stats (file, anon,
      and optionally swap) instead of the conventional page counter.  This
      series proposes changing this calculation to use the global stats instead,
      eliminating the need for a memcg stat flush.
      
      After these 2 contexts are eliminated, we no longer need
      mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic().  We can
      remove them and simplify the code.
      
      [1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/
      
      
      This patch (of 5):
      
      wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
      flush, which can be expensive on large systems. Currently,
      wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
      have to do the rstat flush atomically. On systems with a lot of
      cpus and/or cgroups, this can cause us to disable irqs for a long time,
      potentially causing problems.
      
      Move the call to wb_over_bg_thresh() outside the lock section in
      preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
      The list_empty(&wb->work_list) check should be okay outside the lock
      section of wb->list_lock as it is protected by a separate lock
      (wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
      modifying any of wb->b_* lists the wb->list_lock is protecting.
      Also, the loop seems to be already releasing and reacquring the
      lock, so this refactoring looks safe.
      
      Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com
      
      
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2816ea2a
  2. May 24, 2023
  3. May 23, 2023
  4. May 19, 2023
    • Anna Schumaker's avatar
      NFSv4.2: Fix a potential double free with READ_PLUS · 43439d85
      Anna Schumaker authored
      
      kfree()-ing the scratch page isn't enough, we also need to set the pointer
      back to NULL to avoid a double-free in the case of a resend.
      
      Fixes: fbd2a05f (NFSv4.2: Rework scratch handling for READ_PLUS)
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      43439d85
    • Fabio M. De Francesco's avatar
      NFS: Convert kmap_atomic() to kmap_local_folio() · 4b71e241
      Fabio M. De Francesco authored
      
      kmap_atomic() is deprecated in favor of kmap_local_{folio,page}().
      
      Therefore, replace kmap_atomic() with kmap_local_folio() in
      nfs_readdir_folio_array_append().
      
      kmap_atomic() disables page-faults and preemption (the latter only for
      !PREEMPT_RT kernels), However, the code within the mapping/un-mapping in
      nfs_readdir_folio_array_append() does not depend on the above-mentioned
      side effects.
      
      Therefore, a mere replacement of the old API with the new one is all that
      is required (i.e., there is no need to explicitly add any calls to
      pagefault_disable() and/or preempt_disable()).
      
      Tested with (x)fstests in a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
      with HIGHMEM64GB enabled.
      
      Cc: Ira Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Fixes: ec108d3c ("NFS: Convert readdir page array functions to use a folio")
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      4b71e241
  5. May 18, 2023
  6. May 17, 2023
    • Ryusuke Konishi's avatar
      nilfs2: fix use-after-free bug of nilfs_root in nilfs_evict_inode() · 9b5a04ac
      Ryusuke Konishi authored
      During unmount process of nilfs2, nothing holds nilfs_root structure after
      nilfs2 detaches its writer in nilfs_detach_log_writer().  However, since
      nilfs_evict_inode() uses nilfs_root for some cleanup operations, it may
      cause use-after-free read if inodes are left in "garbage_list" and
      released by nilfs_dispose_list() at the end of nilfs_detach_log_writer().
      
      Fix this issue by modifying nilfs_evict_inode() to only clear inode
      without additional metadata changes that use nilfs_root if the file system
      is degraded to read-only or the writer is detached.
      
      Link: https://lkml.kernel.org/r/20230509152956.8313-1-konishi.ryusuke@gmail.com
      
      
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: default avatar <syzbot+78d4495558999f55d1da@syzkaller.appspotmail.com>
      Closes: https://lkml.kernel.org/r/00000000000099e5ac05fb1c3b85@google.com
      
      
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9b5a04ac
    • Bharath SM's avatar
      SMB3: drop reference to cfile before sending oplock break · 59a556ae
      Bharath SM authored
      
      In cifs_oplock_break function we drop reference to a cfile at
      the end of function, due to which close command goes on wire
      after lease break acknowledgment even if file is already closed
      by application but we had deferred the handle close.
      If other client with limited file shareaccess waiting on lease
      break ack proceeds operation on that file as soon as first client
      sends ack, then we may encounter status sharing violation error
      because of open handle.
      Solution is to put reference to cfile(send close on wire if last ref)
      and then send oplock acknowledgment to server.
      
      Fixes: 9e31678f ("SMB3: fix lease break timeout when multiple deferred close handles for the same file.")
      Cc: stable@kernel.org
      Signed-off-by: default avatarBharath SM <bharathsm@microsoft.com>
      Reviewed-by: default avatarShyam Prasad N <sprasad@microsoft.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      59a556ae
    • Bharath SM's avatar
      SMB3: Close all deferred handles of inode in case of handle lease break · 47592fa8
      Bharath SM authored
      
      Oplock break may occur for different file handle than the deferred
      handle. Check for inode deferred closes list, if it's not empty then
      close all the deferred handles of inode because we should not cache
      handles if we dont have handle lease.
      
      Eg: If openfilelist has one deferred file handle and another open file
      handle from app for a same file, then on a lease break we choose the
      first handle in openfile list. The first handle in list can be deferred
      handle or actual open file handle from app. In case if it is actual open
      handle then today, we don't close deferred handles if we lose handle lease
      on a file. Problem with this is, later if app decides to close the existing
      open handle then we still be caching deferred handles until deferred close
      timeout. Leaving open handle may result in sharing violation when windows
      client tries to open a file with limited file share access.
      
      So we should check for deferred list of inode and walk through the list of
      deferred files in inode and close all deferred files.
      
      Fixes: 9e31678f ("SMB3: fix lease break timeout when multiple deferred close handles for the same file.")
      Cc: stable@kernel.org
      Signed-off-by: default avatarBharath SM <bharathsm@microsoft.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      47592fa8
    • Jeff Layton's avatar
      fs: don't call posix_acl_listxattr in generic_listxattr · 3a7bb21b
      Jeff Layton authored
      
      Commit f2620f16 caused the kernel to start emitting POSIX ACL xattrs
      for NFSv4 inodes, which it doesn't support. The only other user of
      generic_listxattr is HFS (classic) and it doesn't support POSIX ACLs
      either.
      
      Fixes: f2620f16 xattr: simplify listxattr helpers
      Reported-by: default avatarOndrej Valousek <ondrej.valousek.xm@renesas.com>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Message-Id: <20230516124655.82283-1-jlayton@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
    • Ilya Leoshkevich's avatar
      statfs: enforce statfs[64] structure initialization · ed40866e
      Ilya Leoshkevich authored
      
      s390's struct statfs and struct statfs64 contain padding, which
      field-by-field copying does not set. Initialize the respective structs
      with zeros before filling them and copying them to userspace, like it's
      already done for the compat versions of these structs.
      
      Found by KMSAN.
      
      [agordeev@linux.ibm.com: fixed typo in patch description]
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: https://lore.kernel.org/r/20230504144021.808932-2-iii@linux.ibm.com
      
      
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      ed40866e
    • Josef Bacik's avatar
      btrfs: use nofs when cleaning up aborted transactions · 597441b3
      Josef Bacik authored
      
      Our CI system caught a lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        6.3.0-rc7+ #1167 Not tainted
        ------------------------------------------------------
        kswapd0/46 is trying to acquire lock:
        ffff8c6543abd650 (sb_internal#2){++++}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x5f/0x120
      
        but task is already holding lock:
        ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0xa5/0xe0
      	 kmem_cache_alloc+0x31/0x2c0
      	 alloc_extent_state+0x1d/0xd0
      	 __clear_extent_bit+0x2e0/0x4f0
      	 try_release_extent_mapping+0x216/0x280
      	 btrfs_release_folio+0x2e/0x90
      	 invalidate_inode_pages2_range+0x397/0x470
      	 btrfs_cleanup_dirty_bgs+0x9e/0x210
      	 btrfs_cleanup_one_transaction+0x22/0x760
      	 btrfs_commit_transaction+0x3b7/0x13a0
      	 create_subvol+0x59b/0x970
      	 btrfs_mksubvol+0x435/0x4f0
      	 __btrfs_ioctl_snap_create+0x11e/0x1b0
      	 btrfs_ioctl_snap_create_v2+0xbf/0x140
      	 btrfs_ioctl+0xa45/0x28f0
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
        -> #0 (sb_internal#2){++++}-{0:0}:
      	 __lock_acquire+0x1435/0x21a0
      	 lock_acquire+0xc2/0x2b0
      	 start_transaction+0x401/0x730
      	 btrfs_commit_inode_delayed_inode+0x5f/0x120
      	 btrfs_evict_inode+0x292/0x3d0
      	 evict+0xcc/0x1d0
      	 inode_lru_isolate+0x14d/0x1e0
      	 __list_lru_walk_one+0xbe/0x1c0
      	 list_lru_walk_one+0x58/0x80
      	 prune_icache_sb+0x39/0x60
      	 super_cache_scan+0x161/0x1f0
      	 do_shrink_slab+0x163/0x340
      	 shrink_slab+0x1d3/0x290
      	 shrink_node+0x300/0x720
      	 balance_pgdat+0x35c/0x7a0
      	 kswapd+0x205/0x410
      	 kthread+0xf0/0x120
      	 ret_from_fork+0x29/0x50
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(sb_internal#2);
      				 lock(fs_reclaim);
          lock(sb_internal#2);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/46:
         #0: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0
         #1: ffffffffabe50270 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x113/0x290
         #2: ffff8c6543abd0e0 (&type->s_umount_key#44){++++}-{3:3}, at: super_cache_scan+0x38/0x1f0
      
        stack backtrace:
        CPU: 0 PID: 46 Comm: kswapd0 Not tainted 6.3.0-rc7+ #1167
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         <TASK>
         dump_stack_lvl+0x58/0x90
         check_noncircular+0xd6/0x100
         ? save_trace+0x3f/0x310
         ? add_lock_to_list+0x97/0x120
         __lock_acquire+0x1435/0x21a0
         lock_acquire+0xc2/0x2b0
         ? btrfs_commit_inode_delayed_inode+0x5f/0x120
         start_transaction+0x401/0x730
         ? btrfs_commit_inode_delayed_inode+0x5f/0x120
         btrfs_commit_inode_delayed_inode+0x5f/0x120
         btrfs_evict_inode+0x292/0x3d0
         ? lock_release+0x134/0x270
         ? __pfx_wake_bit_function+0x10/0x10
         evict+0xcc/0x1d0
         inode_lru_isolate+0x14d/0x1e0
         __list_lru_walk_one+0xbe/0x1c0
         ? __pfx_inode_lru_isolate+0x10/0x10
         ? __pfx_inode_lru_isolate+0x10/0x10
         list_lru_walk_one+0x58/0x80
         prune_icache_sb+0x39/0x60
         super_cache_scan+0x161/0x1f0
         do_shrink_slab+0x163/0x340
         shrink_slab+0x1d3/0x290
         shrink_node+0x300/0x720
         balance_pgdat+0x35c/0x7a0
         kswapd+0x205/0x410
         ? __pfx_autoremove_wake_function+0x10/0x10
         ? __pfx_kswapd+0x10/0x10
         kthread+0xf0/0x120
         ? __pfx_kthread+0x10/0x10
         ret_from_fork+0x29/0x50
         </TASK>
      
      This happens because when we abort the transaction in the transaction
      commit path we call invalidate_inode_pages2_range on our block group
      cache inodes (if we have space cache v1) and any delalloc inodes we may
      have.  The plain invalidate_inode_pages2_range() call passes through
      GFP_KERNEL, which makes sense in most cases, but not here.  Wrap these
      two invalidate callees with memalloc_nofs_save/memalloc_nofs_restore to
      make sure we don't end up with the fs reclaim dependency under the
      transaction dependency.
      
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      597441b3
    • Johannes Thumshirn's avatar
      btrfs: handle memory allocation failure in btrfs_csum_one_bio · 806570c0
      Johannes Thumshirn authored
      
      Since f8a53bb5 ("btrfs: handle checksum generation in the storage
      layer") the failures of btrfs_csum_one_bio() are handled via
      bio_end_io().
      
      This means, we can return BLK_STS_RESOURCE from btrfs_csum_one_bio() in
      case the allocation of the ordered sums fails.
      
      This also fixes a syzkaller report, where injecting a failure into the
      kvzalloc() call results in a BUG_ON().
      
      Reported-by: default avatar <syzbot+d8941552e21eac774778@syzkaller.appspotmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      806570c0
    • Qu Wenruo's avatar
      btrfs: scrub: try harder to mark RAID56 block groups read-only · 7561551e
      Qu Wenruo authored
      
      Currently we allow a block group not to be marked read-only for scrub.
      
      But for RAID56 block groups if we require the block group to be
      read-only, then we're allowed to use cached content from scrub stripe to
      reduce unnecessary RAID56 reads.
      
      So this patch would:
      
      - Make btrfs_inc_block_group_ro() try harder
        During my tests, for cases like btrfs/061 and btrfs/064, we can hit
        ENOSPC from btrfs_inc_block_group_ro() calls during scrub.
      
        The reason is if we only have one single data chunk, and trying to
        scrub it, we won't have any space left for any newer data writes.
      
        But this check should be done by the caller, especially for scrub
        cases we only temporarily mark the chunk read-only.
        And newer data writes would always try to allocate a new data chunk
        when needed.
      
      - Return error for scrub if we failed to mark a RAID56 chunk read-only
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7561551e
  7. May 16, 2023
    • Gustav Johansson's avatar
      ksmbd: smb2: Allow messages padded to 8byte boundary · e7b8b8ed
      Gustav Johansson authored
      
      clc length is now accepted to <= 8 less than length,
      rather than < 8.
      
      Solve issues on some of Axis's smb clients which send
      messages where clc length is 8 bytes less than length.
      
      The specific client was running kernel 4.19.217 with
      smb dialect 3.0.2 on armv7l.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGustav Johansson <gustajo@axis.com>
      Acked-by: default avatarNamjae Jeon <linkinjeon@kernel.org>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      e7b8b8ed
    • Chih-Yen Chang's avatar
      ksmbd: allocate one more byte for implied bcc[0] · 443d61d1
      Chih-Yen Chang authored
      
      ksmbd_smb2_check_message allows client to return one byte more, so we
      need to allocate additional memory in ksmbd_conn_handler_loop to avoid
      out-of-bound access.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChih-Yen Chang <cc85nod@gmail.com>
      Acked-by: default avatarNamjae Jeon <linkinjeon@kernel.org>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      443d61d1
    • Chih-Yen Chang's avatar
      ksmbd: fix wrong UserName check in session_user · f0a96d1a
      Chih-Yen Chang authored
      
      The offset of UserName is related to the address of security
      buffer. To ensure the validaty of UserName, we need to compare name_off
      + name_len with secbuf_len instead of auth_msg_len.
      
      [   27.096243] ==================================================================
      [   27.096890] BUG: KASAN: slab-out-of-bounds in smb_strndup_from_utf16+0x188/0x350
      [   27.097609] Read of size 2 at addr ffff888005e3b542 by task kworker/0:0/7
      ...
      [   27.099950] Call Trace:
      [   27.100194]  <TASK>
      [   27.100397]  dump_stack_lvl+0x33/0x50
      [   27.100752]  print_report+0xcc/0x620
      [   27.102305]  kasan_report+0xae/0xe0
      [   27.103072]  kasan_check_range+0x35/0x1b0
      [   27.103757]  smb_strndup_from_utf16+0x188/0x350
      [   27.105474]  smb2_sess_setup+0xaf8/0x19c0
      [   27.107935]  handle_ksmbd_work+0x274/0x810
      [   27.108315]  process_one_work+0x419/0x760
      [   27.108689]  worker_thread+0x2a2/0x6f0
      [   27.109385]  kthread+0x160/0x190
      [   27.110129]  ret_from_fork+0x1f/0x30
      [   27.110454]  </TASK>
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChih-Yen Chang <cc85nod@gmail.com>
      Acked-by: default avatarNamjae Jeon <linkinjeon@kernel.org>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      f0a96d1a
    • Chih-Yen Chang's avatar
      ksmbd: fix global-out-of-bounds in smb2_find_context_vals · 02f76c40
      Chih-Yen Chang authored
      
      Add tag_len argument in smb2_find_context_vals() to avoid out-of-bound
      read when create_context's name_len is larger than tag length.
      
      [    7.995411] ==================================================================
      [    7.995866] BUG: KASAN: global-out-of-bounds in memcmp+0x83/0xa0
      [    7.996248] Read of size 8 at addr ffffffff8258d940 by task kworker/0:0/7
      ...
      [    7.998191] Call Trace:
      [    7.998358]  <TASK>
      [    7.998503]  dump_stack_lvl+0x33/0x50
      [    7.998743]  print_report+0xcc/0x620
      [    7.999458]  kasan_report+0xae/0xe0
      [    7.999895]  kasan_check_range+0x35/0x1b0
      [    8.000152]  memcmp+0x83/0xa0
      [    8.000347]  smb2_find_context_vals+0xf7/0x1e0
      [    8.000635]  smb2_open+0x1df2/0x43a0
      [    8.006398]  handle_ksmbd_work+0x274/0x810
      [    8.006666]  process_one_work+0x419/0x760
      [    8.006922]  worker_thread+0x2a2/0x6f0
      [    8.007429]  kthread+0x160/0x190
      [    8.007946]  ret_from_fork+0x1f/0x30
      [    8.008181]  </TASK>
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChih-Yen Chang <cc85nod@gmail.com>
      Acked-by: default avatarNamjae Jeon <linkinjeon@kernel.org>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      02f76c40
  8. May 15, 2023
  9. May 13, 2023
Loading