1. 22 Apr, 2020 6 commits
  2. 21 Apr, 2020 2 commits
    • Sudip Mukherjee's avatar
      coredump: fix null pointer dereference on coredump · db973a72
      Sudip Mukherjee authored
      If the core_pattern is set to "|" and any process segfaults then we get
      a null pointer derefernce while trying to coredump. The call stack shows:
      
          RIP: do_coredump+0x628/0x11c0
      
      When the core_pattern has only "|" there is no use of trying the
      coredump and we can check that while formating the corename and exit
      with an error.
      
      After this change I get:
      
          format_corename failed
          Aborting core
      
      Fixes: 315c6926 ("coredump: split pipe command whitespace before expanding template")
      Reported-by: default avatarMatthew Ruffell <matthew.ruffell@canonical.com>
      Signed-off-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Paul Wise <pabs3@bonedaddy.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200416194612.21418-1-sudipm.mukherjee@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db973a72
    • Jann Horn's avatar
      vmalloc: fix remap_vmalloc_range() bounds checks · bdebd6a2
      Jann Horn authored
      remap_vmalloc_range() has had various issues with the bounds checks it
      promises to perform ("This function checks that addr is a valid
      vmalloc'ed area, and that it is big enough to cover the vma") over time,
      e.g.:
      
       - not detecting pgoff<<PAGE_SHIFT overflow
      
       - not detecting (pgoff<<PAGE_SHIFT)+usize overflow
      
       - not checking whether addr and addr+(pgoff<<PAGE_SHIFT) are the same
         vmalloc allocation
      
       - comparing a potentially wildly out-of-bounds pointer with the end of
         the vmalloc region
      
      In particular, since commit fc970227 ("bpf: Add mmap() support for
      BPF_MAP_TYPE_ARRAY"), unprivileged users can cause kernel null pointer
      dereferences by calling mmap() on a BPF map with a size that is bigger
      than the distance from the start of the BPF map to the end of the
      address space.
      
      This could theoretically be used as a kernel ASLR bypass, by using
      whether mmap() with a given offset oopses or returns an error code to
      perform a binary search over the possible address range.
      
      To allow remap_vmalloc_range_partial() to verify that addr and
      addr+(pgoff<<PAGE_SHIFT) are in the same vmalloc region, pass the offset
      to remap_vmalloc_range_partial() instead of adding it to the pointer in
      remap_vmalloc_range().
      
      In remap_vmalloc_range_partial(), fix the check against
      get_vm_area_size() by using size comparisons instead of pointer
      comparisons, and add checks for pgoff.
      
      Fixes: 83342314 ("[PATCH] mm: introduce remap_vmalloc_range()")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@chromium.org>
      Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bdebd6a2
  3. 17 Apr, 2020 2 commits
    • Chuck Lever's avatar
      SUNRPC: Fix backchannel RPC soft lockups · 6221f1d9
      Chuck Lever authored
      Currently, after the forward channel connection goes away,
      backchannel operations are causing soft lockups on the server
      because call_transmit_status's SOFTCONN logic ignores ENOTCONN.
      Such backchannel Calls are aggressively retried until the client
      reconnects.
      
      Backchannel Calls should use RPC_TASK_NOCONNECT rather than
      RPC_TASK_SOFTCONN. If there is no forward connection, the server is
      not capable of establishing a connection back to the client, thus
      that backchannel request should fail before the server attempts to
      send it. Commit 58255a4e ("NFSD: NFSv4 callback client should
      use RPC_TASK_SOFTCONN") was merged several years before
      RPC_TASK_NOCONNECT was available.
      
      Because setup_callback_client() explicitly sets NOPING, the NFSv4.0
      callback connection depends on the first callback RPC to initiate
      a connection to the client. Thus NFSv4.0 needs to continue to use
      RPC_TASK_SOFTCONN.
      Suggested-by: default avatarTrond Myklebust <trondmy@hammerspace.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Cc: <stable@vger.kernel.org> # v4.20+
      6221f1d9
    • Josef Bacik's avatar
      btrfs: fix setting last_trans for reloc roots · aec7db3b
      Josef Bacik authored
      I made a mistake with my previous fix, I assumed that we didn't need to
      mess with the reloc roots once we were out of the part of relocation where
      we are actually moving the extents.
      
      The subtle thing that I missed is that btrfs_init_reloc_root() also
      updates the last_trans for the reloc root when we do
      btrfs_record_root_in_trans() for the corresponding fs_root.  I've added a
      comment to make sure future me doesn't make this mistake again.
      
      This showed up as a WARN_ON() in btrfs_copy_root() because our
      last_trans didn't == the current transid.  This could happen if we
      snapshotted a fs root with a reloc root after we set
      rc->create_reloc_tree = 0, but before we actually merge the reloc root.
      
      Worth mentioning that the regression produced the following warning
      when running snapshot creation and balance in parallel:
      
        BTRFS info (device sdc): relocating block group 30408704 flags metadata|dup
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs]
        CPU: 0 PID: 12823 Comm: btrfs Tainted: G        W 5.6.0-rc7-btrfs-next-58 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs]
        RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202
        RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48
        RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818
        RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002
        R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818
        R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00
        FS:  00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         ? create_reloc_root+0x49/0x2b0 [btrfs]
         ? kmem_cache_alloc_trace+0xe5/0x200
         create_reloc_root+0x8b/0x2b0 [btrfs]
         btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs]
         create_pending_snapshot+0x610/0x1010 [btrfs]
         create_pending_snapshots+0xa8/0xd0 [btrfs]
         btrfs_commit_transaction+0x4c7/0xc50 [btrfs]
         ? btrfs_mksubvol+0x3cd/0x560 [btrfs]
         btrfs_mksubvol+0x455/0x560 [btrfs]
         __btrfs_ioctl_snap_create+0x15f/0x190 [btrfs]
         btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs]
         ? mem_cgroup_commit_charge+0x6e/0x540
         btrfs_ioctl+0x12d8/0x3760 [btrfs]
         ? do_raw_spin_unlock+0x49/0xc0
         ? _raw_spin_unlock+0x29/0x40
         ? __handle_mm_fault+0x11b3/0x14b0
         ? ksys_ioctl+0x92/0xb0
         ksys_ioctl+0x92/0xb0
         ? trace_hardirqs_off_thunk+0x1a/0x1c
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x5c/0x280
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7fdeabd3bdd7
      
      Fixes: 2abc726a ("btrfs: do not init a reloc root if we aren't relocating")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aec7db3b
  4. 16 Apr, 2020 14 commits
    • Steve French's avatar
      smb3: remove overly noisy debug line in signing errors · 9692ea9d
      Steve French authored
      A dump_stack call for signature related errors can be too noisy
      and not of much value in debugging such problems.
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      Reviewed-by: default avatarShyam Prasad N <nspmangalore@gmail.com>
      9692ea9d
    • Darrick J. Wong's avatar
      xfs: move inode flush to the sync workqueue · f0f7a674
      Darrick J. Wong authored
      Move the inode dirty data flushing to a workqueue so that multiple
      threads can take advantage of a single thread's flushing work.  The
      ratelimiting technique used in bdd4ee4 was not successful, because
      threads that skipped the inode flush scan due to ratelimiting would
      ENOSPC early, which caused occasional (but noticeable) changes in
      behavior and sporadic fstest regressions.
      
      Therefore, make all the writer threads wait on a single inode flush,
      which eliminates both the stampeding hordes of flushers and the small
      window in which a write could fail with ENOSPC because it lost the
      ratelimit race after even another thread freed space.
      
      Fixes: c6425702 ("xfs: ratelimit inode flush on buffered write ENOSPC")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      f0f7a674
    • Andrei Vagin's avatar
      proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets · 94d440d6
      Andrei Vagin authored
      Michael Kerrisk suggested to replace numeric clock IDs with symbolic names.
      
      Now the content of these files looks like this:
      $ cat /proc/774/timens_offsets
      monotonic      864000         0
      boottime      1728000         0
      
      For setting offsets, both representations of clocks (numeric and symbolic)
      can be used.
      
      As for compatibility, it is acceptable to change things as long as
      userspace doesn't care. The format of timens_offsets files is very new and
      there are no userspace tools yet which rely on this format.
      
      But three projects crun, util-linux and criu rely on the interface of
      setting time offsets and this is why it's required to continue supporting
      the numeric clock IDs on write.
      
      Fixes: 04a8682a ("fs/proc: Introduce /proc/pid/timens_offsets")
      Suggested-by: default avatarMichael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: default avatarMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20200411154031.642557-1-avagin@gmail.com
      94d440d6
    • Eric W. Biederman's avatar
      proc: Handle umounts cleanly · 4fa3b1c4
      Eric W. Biederman authored
      syzbot writes:
      > KASAN: use-after-free Read in dput (2)
      >
      > proc_fill_super: allocate dentry failed
      > ==================================================================
      > BUG: KASAN: use-after-free in fast_dput fs/dcache.c:727 [inline]
      > BUG: KASAN: use-after-free in dput+0x53e/0xdf0 fs/dcache.c:846
      > Read of size 4 at addr ffff88808a618cf0 by task syz-executor.0/8426
      >
      > CPU: 0 PID: 8426 Comm: syz-executor.0 Not tainted 5.6.0-next-20200412-syzkaller #0
      > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      > Call Trace:
      >  __dump_stack lib/dump_stack.c:77 [inline]
      >  dump_stack+0x188/0x20d lib/dump_stack.c:118
      >  print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:382
      >  __kasan_report.cold+0x35/0x4d mm/kasan/report.c:511
      >  kasan_report+0x33/0x50 mm/kasan/common.c:625
      >  fast_dput fs/dcache.c:727 [inline]
      >  dput+0x53e/0xdf0 fs/dcache.c:846
      >  proc_kill_sb+0x73/0xf0 fs/proc/root.c:195
      >  deactivate_locked_super+0x8c/0xf0 fs/super.c:335
      >  vfs_get_super+0x258/0x2d0 fs/super.c:1212
      >  vfs_get_tree+0x89/0x2f0 fs/super.c:1547
      >  do_new_mount fs/namespace.c:2813 [inline]
      >  do_mount+0x1306/0x1b30 fs/namespace.c:3138
      >  __do_sys_mount fs/namespace.c:3347 [inline]
      >  __se_sys_mount fs/namespace.c:3324 [inline]
      >  __x64_sys_mount+0x18f/0x230 fs/namespace.c:3324
      >  do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
      >  entry_SYSCALL_64_after_hwframe+0x49/0xb3
      > RIP: 0033:0x45c889
      > Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      > RSP: 002b:00007ffc1930ec48 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
      > RAX: ffffffffffffffda RBX: 0000000001324914 RCX: 000000000045c889
      > RDX: 0000000020000140 RSI: 0000000020000040 RDI: 0000000000000000
      > RBP: 000000000076bf00 R08: 0000000000000000 R09: 0000000000000000
      > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
      > R13: 0000000000000749 R14: 00000000004ca15a R15: 0000000000000013
      
      Looking at the code now that it the internal mount of proc is no
      longer used it is possible to unmount proc.   If proc is unmounted
      the fields of the pid namespace that were used for filesystem
      specific state are not reinitialized.
      
      Which means that proc_self and proc_thread_self can be pointers to
      already freed dentries.
      
      The reported user after free appears to be from mounting and
      unmounting proc followed by mounting proc again and using error
      injection to cause the new root dentry allocation to fail.  This in
      turn results in proc_kill_sb running with proc_self and
      proc_thread_self still retaining their values from the previous mount
      of proc.  Then calling dput on either proc_self of proc_thread_self
      will result in double put.  Which KASAN sees as a use after free.
      
      Solve this by always reinitializing the filesystem state stored
      in the struct pid_namespace, when proc is unmounted.
      
      Reported-by: syzbot+72868dd424eb66c6b95f@syzkaller.appspotmail.com
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Fixes: 69879c01 ("proc: Remove the now unnecessary internal mount of proc")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      4fa3b1c4
    • Theodore Ts'o's avatar
      ext4: convert BUG_ON's to WARN_ON's in mballoc.c · 907ea529
      Theodore Ts'o authored
      If the in-core buddy bitmap gets corrupted (or out of sync with the
      block bitmap), issue a WARN_ON and try to recover.  In most cases this
      involves skipping trying to allocate out of a particular block group.
      We can end up declaring the file system corrupted, which is fair,
      since the file system probably should be checked before we proceed any
      further.
      
      Link: https://lore.kernel.org/r/20200414035649.293164-1-tytso@mit.edu
      Google-Bug-Id: 34811296
      Google-Bug-Id: 34639169
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      907ea529
    • Theodore Ts'o's avatar
      ext4: increase wait time needed before reuse of deleted inode numbers · a17a9d93
      Theodore Ts'o authored
      Current wait times have proven to be too short to protect against inode
      reuses that lead to metadata inconsistencies.
      
      Now that we will retry the inode allocation if we can't find any
      recently deleted inodes, it's a lot safer to increase the recently
      deleted time from 5 seconds to a minute.
      
      Link: https://lore.kernel.org/r/20200414023925.273867-1-tytso@mit.edu
      Google-Bug-Id: 36602237
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a17a9d93
    • Jason Yan's avatar
      ext4: remove set but not used variable 'es' in ext4_jbd2.c · 64881411
      Jason Yan authored
      Fix the following gcc warning:
      
      fs/ext4/ext4_jbd2.c:341:30: warning: variable 'es' set but not used [-Wunused-but-set-variable]
           struct ext4_super_block *es;
                                    ^~
      
      Fixes: 2ea2fc775321 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarJason Yan <yanaijie@huawei.com>
      Link: https://lore.kernel.org/r/20200402034759.29957-1-yanaijie@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      64881411
    • Jason Yan's avatar
      ext4: remove set but not used variable 'es' · 05ca87c1
      Jason Yan authored
      Fix the following gcc warning:
      
      fs/ext4/super.c:599:27: warning: variable 'es' set but not used [-Wunused-but-set-variable]
        struct ext4_super_block *es;
                                 ^~
      Fixes: 2ea2fc775321 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarJason Yan <yanaijie@huawei.com>
      Link: https://lore.kernel.org/r/20200402033939.25303-1-yanaijie@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      05ca87c1
    • Jan Kara's avatar
      ext4: do not zeroout extents beyond i_disksize · 801674f3
      Jan Kara authored
      We do not want to create initialized extents beyond end of file because
      for e2fsck it is impossible to distinguish them from a case of corrupted
      file size / extent tree and so it complains like:
      
      Inode 12, i_size is 147456, should be 163840.  Fix? no
      
      Code in ext4_ext_convert_to_initialized() and
      ext4_split_convert_extents() try to make sure it does not create
      initialized extents beyond inode size however they check against
      inode->i_size which is wrong. They should instead check against
      EXT4_I(inode)->i_disksize which is the current inode size on disk.
      That's what e2fsck is going to see in case of crash before all dirty
      data is written. This bug manifests as generic/456 test failure (with
      recent enough fstests where fsx got fixed to properly pass
      FALLOC_KEEP_SIZE_FL flags to the kernel) when run with dioread_lock
      mount option.
      
      CC: stable@vger.kernel.org
      Fixes: 21ca087a ("ext4: Do not zero out uninitialized extents beyond i_size")
      Reviewed-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20200331105016.8674-1-jack@suse.czSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      801674f3
    • Josh Triplett's avatar
      ext4: fix return-value types in several function comments · 9033783c
      Josh Triplett authored
      The documentation comments for ext4_read_block_bitmap_nowait and
      ext4_read_inode_bitmap describe them as returning NULL on error, but
      they return an ERR_PTR on error; update the documentation to match.
      
      The documentation comment for ext4_wait_block_bitmap describes it as
      returning 1 on error, but it returns -errno on error; update the
      documentation to match.
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: default avatarRitesh Harani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/r/60a3f4996f4932c45515aaa6b75ca42f2a78ec9b.1585512514.git.josh@joshtriplett.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      9033783c
    • Roman Gushchin's avatar
      ext4: use non-movable memory for superblock readahead · d87f6392
      Roman Gushchin authored
      Since commit a8ac900b ("ext4: use non-movable memory for the
      superblock") buffers for ext4 superblock were allocated using
      the sb_bread_unmovable() helper which allocated buffer heads
      out of non-movable memory blocks. It was necessarily to not block
      page migrations and do not cause cma allocation failures.
      
      However commit 85c8f176 ("ext4: preload block group descriptors")
      broke this by introducing pre-reading of the ext4 superblock.
      The problem is that __breadahead() is using __getblk() underneath,
      which allocates buffer heads out of movable memory.
      
      It resulted in page migration failures I've seen on a machine
      with an ext4 partition and a preallocated cma area.
      
      Fix this by introducing sb_breadahead_unmovable() and
      __breadahead_gfp() helpers which use non-movable memory for buffer
      head allocations and use them for the ext4 superblock readahead.
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Fixes: 85c8f176 ("ext4: preload block group descriptors")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      d87f6392
    • yangerkun's avatar
      ext4: use matching invalidatepage in ext4_writepage · c2a559bc
      yangerkun authored
      Run generic/388 with journal data mode sometimes may trigger the warning
      in ext4_invalidatepage. Actually, we should use the matching invalidatepage
      in ext4_writepage.
      Signed-off-by: default avataryangerkun <yangerkun@huawei.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20200226041002.13914-1-yangerkun@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      c2a559bc
    • Jones Syue's avatar
      cifs: improve read performance for page size 64KB & cache=strict & vers=2.1+ · 1f641d94
      Jones Syue authored
      Found a read performance issue when linux kernel page size is 64KB.
      If linux kernel page size is 64KB and mount options cache=strict &
      vers=2.1+, it does not support cifs_readpages(). Instead, it is using
      cifs_readpage() and cifs_read() with maximum read IO size 16KB, which is
      much slower than read IO size 1MB when negotiated SMB 2.1+. Since modern
      SMB server supported SMB 2.1+ and Max Read Size can reach more than 64KB
      (for example 1MB ~ 8MB), this patch check max_read instead of maxBuf to
      determine whether server support readpages() and improve read performance
      for page size 64KB & cache=strict & vers=2.1+, and for SMB1 it is more
      cleaner to initialize server->max_read to server->maxBuf.
      
      The client is a linux box with linux kernel 4.2.8,
      page size 64KB (CONFIG_ARM64_64K_PAGES=y),
      cpu arm 1.7GHz, and use mount.cifs as smb client.
      The server is another linux box with linux kernel 4.2.8,
      share a file '10G.img' with size 10GB,
      and use samba-4.7.12 as smb server.
      
      The client mount a share from the server with different
      cache options: cache=strict and cache=none,
      mount -tcifs //<server_ip>/Public /cache_strict -overs=3.0,cache=strict,username=<xxx>,password=<yyy>
      mount -tcifs //<server_ip>/Public /cache_none -overs=3.0,cache=none,username=<xxx>,password=<yyy>
      
      The client download a 10GbE file from the server across 1GbE network,
      dd if=/cache_strict/10G.img of=/dev/null bs=1M count=10240
      dd if=/cache_none/10G.img of=/dev/null bs=1M count=10240
      
      Found that cache=strict (without patch) is slower read throughput and
      smaller read IO size than cache=none.
      cache=strict (without patch): read throughput 40MB/s, read IO size is 16KB
      cache=strict (with patch): read throughput 113MB/s, read IO size is 1MB
      cache=none: read throughput 109MB/s, read IO size is 1MB
      
      Looks like if page size is 64KB,
      cifs_set_ops() would use cifs_addr_ops_smallbuf instead of cifs_addr_ops,
      
      	/* check if server can support readpages */
      	if (cifs_sb_master_tcon(cifs_sb)->ses->server->maxBuf <
      			PAGE_SIZE + MAX_CIFS_HDR_SIZE)
      		inode->i_data.a_ops = &cifs_addr_ops_smallbuf;
      	else
      		inode->i_data.a_ops = &cifs_addr_ops;
      
      maxBuf is came from 2 places, SMB2_negotiate() and CIFSSMBNegotiate(),
      (SMB2_MAX_BUFFER_SIZE is 64KB)
      SMB2_negotiate():
      	/* set it to the maximum buffer size value we can send with 1 credit */
      	server->maxBuf = min_t(unsigned int, le32_to_cpu(rsp->MaxTransactSize),
      			       SMB2_MAX_BUFFER_SIZE);
      CIFSSMBNegotiate():
      	server->maxBuf = le32_to_cpu(pSMBr->MaxBufferSize);
      
      Page size 64KB and cache=strict lead to read_pages() use cifs_readpage()
      instead of cifs_readpages(), and then cifs_read() using maximum read IO
      size 16KB, which is much slower than maximum read IO size 1MB.
      (CIFSMaxBufSize is 16KB by default)
      
      	/* FIXME: set up handlers for larger reads and/or convert to async */
      	rsize = min_t(unsigned int, cifs_sb->rsize, CIFSMaxBufSize);
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarJones Syue <jonessyue@qnap.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      1f641d94
    • Ronnie Sahlberg's avatar
      cifs: dump the session id and keys also for SMB2 sessions · f560cda9
      Ronnie Sahlberg authored
      We already dump these keys for SMB3, lets also dump it for SMB2
      sessions so that we can use the session key in wireshark to check and validate
      that the signatures are correct.
      Signed-off-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      Reviewed-by: default avatarAurelien Aptel <aaptel@suse.com>
      f560cda9
  5. 15 Apr, 2020 3 commits
    • Pavel Begunkov's avatar
      io_uring: don't count rqs failed after current one · 31af27c7
      Pavel Begunkov authored
      When checking for draining with __req_need_defer(), it tries to match
      how many requests were sent before a current one with number of already
      completed. Dropped SQEs are included in req->sequence, and they won't
      ever appear in CQ. To compensate for that, __req_need_defer() substracts
      ctx->cached_sq_dropped.
      However, what it should really use is number of SQEs dropped __before__
      the current one. In other words, any submitted request shouldn't
      shouldn't affect dequeueing from the drain queue of previously submitted
      ones.
      
      Instead of saving proper ctx->cached_sq_dropped in each request,
      substract from req->sequence it at initialisation, so it includes number
      of properly submitted requests.
      
      note: it also changes behaviour of timeouts, but
      1. it's already diverge from the description because of using SQ
      2. the description is ambiguous regarding dropped SQEs
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      31af27c7
    • Pavel Begunkov's avatar
      io_uring: kill already cached timeout.seq_offset · b55ce732
      Pavel Begunkov authored
      req->timeout.count and req->io->timeout.seq_offset store the same value,
      which is sqe->off. Kill the second one
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b55ce732
    • Pavel Begunkov's avatar
      io_uring: fix cached_sq_head in io_timeout() · 22cad158
      Pavel Begunkov authored
      io_timeout() can be executed asynchronously by a worker and without
      holding ctx->uring_lock
      
      1. using ctx->cached_sq_head there is racy there
      2. it should count events from a moment of timeout's submission, but
      not execution
      
      Use req->sequence.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      22cad158
  6. 13 Apr, 2020 13 commits
    • Jens Axboe's avatar
      io_uring: only post events in io_poll_remove_all() if we completed some · 8e2e1faf
      Jens Axboe authored
      syzbot reports this crash:
      
      BUG: unable to handle page fault for address: ffffffffffffffe8
      PGD f96e17067 P4D f96e17067 PUD f96e19067 PMD 0
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      CPU: 55 PID: 211750 Comm: trinity-c127 Tainted: G    B        L    5.7.0-rc1-next-20200413 #4
      Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 04/12/2017
      RIP: 0010:__wake_up_common+0x98/0x290
      el/sched/wait.c:87
      Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10
      RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8
      RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8
      RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd
      R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000
      R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8
      FS:  00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0
      Call Trace:
       __wake_up_common_lock+0xea/0x150
      ommon_lock at kernel/sched/wait.c:124
       ? __wake_up_common+0x290/0x290
       ? lockdep_hardirqs_on+0x16/0x2c0
       __wake_up+0x13/0x20
       io_cqring_ev_posted+0x75/0xe0
      v_posted at fs/io_uring.c:1160
       io_ring_ctx_wait_and_kill+0x1c0/0x2f0
      l at fs/io_uring.c:7305
       io_uring_create+0xa8d/0x13b0
       ? io_req_defer_prep+0x990/0x990
       ? __kasan_check_write+0x14/0x20
       io_uring_setup+0xb8/0x130
       ? io_uring_create+0x13b0/0x13b0
       ? check_flags.part.28+0x220/0x220
       ? lockdep_hardirqs_on+0x16/0x2c0
       __x64_sys_io_uring_setup+0x31/0x40
       do_syscall_64+0xcc/0xaf0
       ? syscall_return_slowpath+0x580/0x580
       ? lockdep_hardirqs_off+0x1f/0x140
       ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
       ? trace_hardirqs_off_caller+0x3a/0x150
       ? trace_hardirqs_off_thunk+0x1a/0x1c
       entry_SYSCALL_64_after_hwframe+0x49/0xb3
      RIP: 0033:0x7fdcb9dd76ed
      Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 57 2c 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffe7fd4e4f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      RAX: ffffffffffffffda RBX: 00000000000001a9 RCX: 00007fdcb9dd76ed
      RDX: fffffffffffffffc RSI: 0000000000000000 RDI: 0000000000005d54
      RBP: 00000000000001a9 R08: 0000000e31d3caa7 R09: 0082400004004000
      R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000002
      R13: 00007fdcb842e058 R14: 00007fdcba4c46c0 R15: 00007fdcb842e000
      Modules linked in: bridge stp llc nfnetlink cn brd vfat fat ext4 crc16 mbcache jbd2 loop kvm_intel kvm irqbypass intel_cstate intel_uncore dax_pmem intel_rapl_perf dax_pmem_core ip_tables x_tables xfs sd_mod tg3 firmware_class libphy hpsa scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: binfmt_misc]
      CR2: ffffffffffffffe8
      ---[ end trace f9502383d57e0e22 ]---
      RIP: 0010:__wake_up_common+0x98/0x290
      Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10
      RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8
      RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8
      RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd
      R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000
      R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8
      FS:  00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: 0x29800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      ---[ end Kernel panic - not syncing: Fatal exception ]—
      
      which is due to error injection (or allocation failure) preventing the
      rings from being setup. On shutdown, we attempt to remove any pending
      requests, and for poll request, we call io_cqring_ev_posted() when we've
      killed poll requests. However, since the rings aren't setup, we won't
      find any poll requests. Make the calling of io_cqring_ev_posted()
      dependent on actually having completed requests. This fixes this setup
      corner case, and removes spurious calls if we remove poll requests and
      don't find any.
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8e2e1faf
    • Trond Myklebust's avatar
      NFS: Fix an ABBA spinlock issue in pnfs_update_layout() · fbf4bcc9
      Trond Myklebust authored
      We need to drop the inode spinlock while calling nfs4_select_rw_stateid(),
      since nfs4_copy_delegation_stateid() could take the delegation lock.
      Note that it is safe to do this, since all other calls to
      pnfs_update_layout() for that inode will find themselves blocked by
      the lock we hold on NFS_LAYOUT_FIRST_LAYOUTGET.
      
      Fixes: fc51b1cf ("NFS: Beware when dereferencing the delegation cred")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      fbf4bcc9
    • Jeff Layton's avatar
      ceph: fix potential bad pointer deref in async dirops cb's · 2a575f13
      Jeff Layton authored
      The new async dirops callback routines can pass ERR_PTR values to
      ceph_mdsc_free_path, which could cause an oops. Make ceph_mdsc_free_path
      ignore ERR_PTR values. Also, ensure that the pr_warn messages look sane
      even if ceph_mdsc_build_path fails.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      2a575f13
    • Jens Axboe's avatar
      io_uring: io_async_task_func() should check and honor cancelation · 2bae047e
      Jens Axboe authored
      If the request has been marked as canceled, don't try and issue it.
      Instead just fill a canceled event and finish the request.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2bae047e
    • Jens Axboe's avatar
      io_uring: check for need to re-wait in polled async handling · 74ce6ce4
      Jens Axboe authored
      We added this for just the regular poll requests in commit a6ba632d
      ("io_uring: retry poll if we got woken with non-matching mask"), we
      should do the same for the poll handler used pollable async requests.
      Move the re-wait check and arm into a helper, and call it from
      io_async_task_func() as well.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      74ce6ce4
    • Darrick J. Wong's avatar
      xfs: fix partially uninitialized structure in xfs_reflink_remap_extent · c142932c
      Darrick J. Wong authored
      In the reflink extent remap function, it turns out that uirec (the block
      mapping corresponding only to the part of the passed-in mapping that got
      unmapped) was not fully initialized.  Specifically, br_state was not
      being copied from the passed-in struct to the uirec.  This could lead to
      unpredictable results such as the reflinked mapping being marked
      unwritten in the destination file.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      c142932c
    • Brian Foster's avatar
      xfs: acquire superblock freeze protection on eofblocks scans · 4b674b9a
      Brian Foster authored
      The filesystem freeze sequence in XFS waits on any background
      eofblocks or cowblocks scans to complete before the filesystem is
      quiesced. At this point, the freezer has already stopped the
      transaction subsystem, however, which means a truncate or cowblock
      cancellation in progress is likely blocked in transaction
      allocation. This results in a deadlock between freeze and the
      associated scanner.
      
      Fix this problem by holding superblock write protection across calls
      into the block reapers. Since protection for background scans is
      acquired from the workqueue task context, trylock to avoid a similar
      deadlock between freeze and blocking on the write lock.
      
      Fixes: d6b636eb ("xfs: halt auto-reclamation activities while rebuilding rmap")
      Reported-by: default avatarPaul Furtado <paulfurtado91@gmail.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChandan Rajendra <chandanrlinux@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAllison Collins <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      4b674b9a
    • Vasily Averin's avatar
      nfsd: memory corruption in nfsd4_lock() · e1e8399e
      Vasily Averin authored
      New struct nfsd4_blocked_lock allocated in find_or_allocate_block()
      does not initialized nbl_list and nbl_lru.
      If conflock allocation fails rollback can call list_del_init()
      access uninitialized fields and corrupt memory.
      
      v2: just initialize nbl_list and nbl_lru right after nbl allocation.
      
      Fixes: 76d348fa ("nfsd: have nfsd4_lock use blocking locks for v4.1+ lock")
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      e1e8399e
    • David Howells's avatar
      afs: Fix afs_d_validate() to set the right directory version · 40fc8102
      David Howells authored
      If a dentry's version is somewhere between invalid_before and the current
      directory version, we should be setting it forward to the current version,
      not backwards to the invalid_before version.  Note that we're only doing
      this at all because dentry::d_fsdata isn't large enough on a 32-bit system.
      
      Fix this by using a separate variable for invalid_before so that we don't
      accidentally clobber the current dir version.
      
      Fixes: a4ff7401 ("afs: Keep track of invalid-before version for dentry coherency")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      40fc8102
    • David Howells's avatar
      afs: Fix race between post-modification dir edit and readdir/d_revalidate · 2105c282
      David Howells authored
      AFS directories are retained locally as a structured file, with lookup
      being effected by a local search of the file contents.  When a modification
      (such as mkdir) happens, the dir file content is modified locally rather
      than redownloading the directory.
      
      The directory contents are accessed in a number of ways, with a number of
      different locks schemes:
      
       (1) Download of contents - dvnode->validate_lock/write in afs_read_dir().
      
       (2) Lookup and readdir - dvnode->validate_lock/read in afs_dir_iterate(),
           downgrading from (1) if necessary.
      
       (3) d_revalidate of child dentry - dvnode->validate_lock/read in
           afs_do_lookup_one() downgrading from (1) if necessary.
      
       (4) Edit of dir after modification - page locks on individual dir pages.
      
      Unfortunately, because (4) uses different locking scheme to (1) - (3),
      nothing protects against the page being scanned whilst the edit is
      underway.  Even download is not safe as it doesn't lock the pages - relying
      instead on the validate_lock to serialise as a whole (the theory being that
      directory contents are treated as a block and always downloaded as a
      block).
      
      Fix this by write-locking dvnode->validate_lock around the edits.  Care
      must be taken in the rename case as there may be two different dirs - but
      they need not be locked at the same time.  In any case, once the lock is
      taken, the directory version must be rechecked, and the edit skipped if a
      later version has been downloaded by revalidation (there can't have been
      any local changes because the VFS holds the inode lock, but there can have
      been remote changes).
      
      Fixes: 63a4681f ("afs: Locally edit directory data for mkdir/create/unlink/...")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      2105c282
    • David Howells's avatar
      afs: Fix length of dump of bad YFSFetchStatus record · 3efe55b0
      David Howells authored
      Fix the length of the dump of a bad YFSFetchStatus record.  The function
      was copied from the AFS version, but the YFS variant contains bigger fields
      and extra information, so expand the dump to match.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      3efe55b0
    • David Howells's avatar
      afs: Fix rename operation status delivery · b98f0ec9
      David Howells authored
      The afs_deliver_fs_rename() and yfs_deliver_fs_rename() functions both only
      decode the second file status returned unless the parent directories are
      different - unfortunately, this means that the xdr pointer isn't advanced
      and the volsync record will be read incorrectly in such an instance.
      
      Fix this by always decoding the second status into the second
      status/callback block which wasn't being used if the dirs were the same.
      
      The afs_update_dentry_version() calls that update the directory data
      version numbers on the dentries can then unconditionally use the second
      status record as this will always reflect the state of the destination dir
      (the two records will be identical if the destination dir is the same as
      the source dir)
      
      Fixes: 260a9803 ("[AFS]: Add "directory write" support.")
      Fixes: 30062bd1 ("afs: Implement YFS support in the fs client")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      b98f0ec9
    • David Howells's avatar
      afs: Fix decoding of inline abort codes from version 1 status records · 3e0d9892
      David Howells authored
      If we're decoding an AFSFetchStatus record and we see that the version is 1
      and the abort code is set and we're expecting inline errors, then we store
      the abort code and ignore the remaining status record (which is correct),
      but we don't set the flag to say we got a valid abort code.
      
      This can affect operation of YFS.RemoveFile2 when removing a file and the
      operation of {,Y}FS.InlineBulkStatus when prospectively constructing or
      updating of a set of inodes during a lookup.
      
      Fix this to indicate the reception of a valid abort code.
      
      Fixes: a38a7558 ("afs: Fix unlink to handle YFS.RemoveFile2 better")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      3e0d9892