1. 24 Jul, 2020 1 commit
  2. 23 Jul, 2020 2 commits
  3. 22 Jul, 2020 1 commit
  4. 21 Jul, 2020 8 commits
    • Boris Burkov's avatar
      btrfs: fix mount failure caused by race with umount · 48cfa61b
      Boris Burkov authored
      It is possible to cause a btrfs mount to fail by racing it with a slow
      umount. The crux of the sequence is generic_shutdown_super not yet
      calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
      If that occurs, btrfs_open_devices will decide the opened counter is
      non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
      0. From here, mount will call sget which will result in grab_super
      trying to take the super block umount semaphore. That semaphore will be
      held by the slow umount, so mount will block. Before up-ing the
      semaphore, umount will delete the super block, resulting in mount's sget
      reliably allocating a new one, which causes the mount path to dutifully
      fill it out, and increment total_rw_bytes a second time, which causes
      the mount to fail, as we see double the expected bytes.
      
      Here is the sequence laid out in greater detail:
      
      CPU0                                                    CPU1
      down_write sb->s_umount
      btrfs_kill_super
        kill_anon_super(sb)
          generic_shutdown_super(sb);
            shrink_dcache_for_umount(sb);
            sync_filesystem(sb);
            evict_inodes(sb); // SLOW
      
                                                    btrfs_mount_root
                                                      btrfs_scan_one_device
                                                      fs_devices = device->fs_devices
                                                      fs_info->fs_devices = fs_devices
                                                      // fs_devices-opened makes this a no-op
                                                      btrfs_open_devices(fs_devices, mode, fs_type)
                                                      s = sget(fs_type, test, set, flags, fs_info);
                                                        find sb in s_instances
                                                        grab_super(sb);
                                                          down_write(&s->s_umount); // blocks
      
            sop->put_super(sb)
              // sb->fs_devices->opened == 2; no-op
            spin_lock(&sb_lock);
            hlist_del_init(&sb->s_instances);
            spin_unlock(&sb_lock);
            up_write(&sb->s_umount);
                                                          return 0;
                                                        retry lookup
                                                        don't find sb in s_instances (deleted by CPU0)
                                                        s = alloc_super
                                                        return s;
                                                      btrfs_fill_super(s, fs_devices, data)
                                                        open_ctree // fs_devices total_rw_bytes improperly set!
                                                          btrfs_read_chunk_tree
                                                            read_one_dev // increment total_rw_bytes again!!
                                                            super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
      
      To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
      before the calls to read_one_dev, while holding the sb umount semaphore
      and the uuid mutex.
      
      To reproduce, it is sufficient to dirty a decent number of inodes, then
      quickly umount and mount.
      
        for i in $(seq 0 500)
        do
          dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
        done
        umount /mnt/foo&
        mount /mnt/foo
      
      does the trick for me.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48cfa61b
    • Robbie Ko's avatar
      btrfs: fix page leaks after failure to lock page for delalloc · 5909ca11
      Robbie Ko authored
      When locking pages for delalloc, we check if it's dirty and mapping still
      matches. If it does not match, we need to return -EAGAIN and release all
      pages. Only the current page was put though, iterate over all the
      remaining pages too.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5909ca11
    • Qu Wenruo's avatar
      btrfs: qgroup: fix data leak caused by race between writeback and truncate · fa91e4aa
      Qu Wenruo authored
      [BUG]
      When running tests like generic/013 on test device with btrfs quota
      enabled, it can normally lead to data leak, detected at unmount time:
      
        BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096
        ------------[ cut here ]------------
        WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
        RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace caf08beafeca2392 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
      
      [CAUSE]
      In the offending case, the offending operations are:
      2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0
      2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0
      
      The following sequence of events could happen after the writev():
      	CPU1 (writeback)		|		CPU2 (truncate)
      -----------------------------------------------------------------
      btrfs_writepages()			|
      |- extent_write_cache_pages()		|
         |- Got page for 1003520		|
         |  1003520 is Dirty, no writeback	|
         |  So (!clear_page_dirty_for_io())   |
         |  gets called for it		|
         |- Now page 1003520 is Clean.	|
         |					| btrfs_setattr()
         |					| |- btrfs_setsize()
         |					|    |- truncate_setsize()
         |					|       New i_size is 18388
         |- __extent_writepage()		|
         |  |- page_offset() > i_size		|
            |- btrfs_invalidatepage()		|
      	 |- Page is clean, so no qgroup |
      	    callback executed
      
      This means, the qgroup reserved data space is not properly released in
      btrfs_invalidatepage() as the page is Clean.
      
      [FIX]
      Instead of checking the dirty bit of a page, call
      btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage().
      
      As qgroup rsv are completely bound to the QGROUP_RESERVED bit of
      io_tree, not bound to page status, thus we won't cause double freeing
      anyway.
      
      Fixes: 0b34c261 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa91e4aa
    • Filipe Manana's avatar
      btrfs: fix double free on ulist after backref resolution failure · 580c079b
      Filipe Manana authored
      At btrfs_find_all_roots_safe() we allocate a ulist and set the **roots
      argument to point to it. However if later we fail due to an error returned
      by find_parent_nodes(), we free that ulist but leave a dangling pointer in
      the **roots argument. Upon receiving the error, a caller of this function
      can attempt to free the same ulist again, resulting in an invalid memory
      access.
      
      One such scenario is during qgroup accounting:
      
      btrfs_qgroup_account_extents()
      
       --> calls btrfs_find_all_roots() passes &new_roots (a stack allocated
           pointer) to btrfs_find_all_roots()
      
         --> btrfs_find_all_roots() just calls btrfs_find_all_roots_safe()
             passing &new_roots to it
      
           --> allocates ulist and assigns its address to **roots (which
               points to new_roots from btrfs_qgroup_account_extents())
      
           --> find_parent_nodes() returns an error, so we free the ulist
               and leave **roots pointing to it after returning
      
       --> btrfs_qgroup_account_extents() sees btrfs_find_all_roots() returned
           an error and jumps to the label 'cleanup', which just tries to
           free again the same ulist
      
      Stack trace example:
      
       ------------[ cut here ]------------
       BTRFS: tree first key check failed
       WARNING: CPU: 1 PID: 1763215 at fs/btrfs/disk-io.c:422 btrfs_verify_level_key+0xe0/0x180 [btrfs]
       Modules linked in: dm_snapshot dm_thin_pool (...)
       CPU: 1 PID: 1763215 Comm: fsstress Tainted: G        W         5.8.0-rc3-btrfs-next-64 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_verify_level_key+0xe0/0x180 [btrfs]
       Code: 28 5b 5d (...)
       RSP: 0018:ffffb89b473779a0 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: ffff90397759bf08 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
       RBP: ffff9039a419c000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: ffffb89b43301000 R12: 000000000000005e
       R13: ffffb89b47377a2e R14: ffffb89b473779af R15: 0000000000000000
       FS:  00007fc47e1e1000(0000) GS:ffff9039ac200000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fc47e1df000 CR3: 00000003d9e4e001 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        read_block_for_search+0xf6/0x350 [btrfs]
        btrfs_next_old_leaf+0x242/0x650 [btrfs]
        resolve_indirect_refs+0x7cf/0x9e0 [btrfs]
        find_parent_nodes+0x4ea/0x12c0 [btrfs]
        btrfs_find_all_roots_safe+0xbf/0x130 [btrfs]
        btrfs_qgroup_account_extents+0x9d/0x390 [btrfs]
        btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
        btrfs_sync_file+0x3d4/0x4d0 [btrfs]
        do_fsync+0x38/0x70
        __x64_sys_fdatasync+0x13/0x20
        do_syscall_64+0x5c/0xe0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fc47e2d72e3
       Code: Bad RIP value.
       RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
       RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
       RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
       RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
       R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
       R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0
       softirqs last  enabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace 8639237550317b48 ]---
       BTRFS error (device sdc): tree first key mismatch detected, bytenr=62324736 parent_transid=94 key expected=(262,108,1351680) has=(259,108,1921024)
       general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
       CPU: 2 PID: 1763215 Comm: fsstress Tainted: G        W         5.8.0-rc3-btrfs-next-64 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:ulist_release+0x14/0x60 [btrfs]
       Code: c7 07 00 (...)
       RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
       RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
       RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
       R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
       FS:  00007fc47e1e1000(0000) GS:ffff9039ac600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f8c1c0a51c8 CR3: 00000003d9e4e004 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        ulist_free+0x13/0x20 [btrfs]
        btrfs_qgroup_account_extents+0xf3/0x390 [btrfs]
        btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
        btrfs_sync_file+0x3d4/0x4d0 [btrfs]
        do_fsync+0x38/0x70
        __x64_sys_fdatasync+0x13/0x20
        do_syscall_64+0x5c/0xe0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fc47e2d72e3
       Code: Bad RIP value.
       RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
       RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
       RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
       RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
       R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
       R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
       Modules linked in: dm_snapshot dm_thin_pool (...)
       ---[ end trace 8639237550317b49 ]---
       RIP: 0010:ulist_release+0x14/0x60 [btrfs]
       Code: c7 07 00 (...)
       RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
       RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
       RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
       R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
       FS:  00007fc47e1e1000(0000) GS:ffff9039ad200000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f6a776f7d40 CR3: 00000003d9e4e002 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Fix this by making btrfs_find_all_roots_safe() set *roots to NULL after
      it frees the ulist.
      
      Fixes: 8da6d581 ("Btrfs: added btrfs_find_all_roots()")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      580c079b
    • Ilya Ponetayev's avatar
      exfat: fix name_hash computation on big endian systems · db415f7a
      Ilya Ponetayev authored
      On-disk format for name_hash field is LE, so it must be explicitly
      transformed on BE system for proper result.
      
      Fixes: 370e812b ("exfat: add nls operations")
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: default avatarChen Minqiang <ptpt52@gmail.com>
      Signed-off-by: default avatarIlya Ponetayev <i.ponetaev@ndmsystems.com>
      Reviewed-by: default avatarSungjong Seo <sj1557.seo@samsung.com>
      Signed-off-by: default avatarNamjae Jeon <namjae.jeon@samsung.com>
      db415f7a
    • Hyeongseok Kim's avatar
      exfat: fix wrong size update of stream entry by typo · 41e3928f
      Hyeongseok Kim authored
      The stream.size field is updated to the value of create timestamp
      of the file entry. Fix this to use correct stream entry pointer.
      
      Fixes: 29bbb14b ("exfat: fix incorrect update of stream entry in __exfat_truncate()")
      Signed-off-by: default avatarHyeongseok Kim <hyeongseok@gmail.com>
      Signed-off-by: default avatarNamjae Jeon <namjae.jeon@samsung.com>
      41e3928f
    • Namjae Jeon's avatar
      exfat: fix wrong hint_stat initialization in exfat_find_dir_entry() · d2fa0c33
      Namjae Jeon authored
      We found the wrong hint_stat initialization in exfat_find_dir_entry().
      It should be initialized when cluster is EXFAT_EOF_CLUSTER.
      
      Fixes: ca061973 ("exfat: add directory operations")
      Cc: stable@vger.kernel.org # v5.7
      Reviewed-by: default avatarSungjong Seo <sj1557.seo@samsung.com>
      Signed-off-by: default avatarNamjae Jeon <namjae.jeon@samsung.com>
      d2fa0c33
    • Namjae Jeon's avatar
      exfat: fix overflow issue in exfat_cluster_to_sector() · 43946b70
      Namjae Jeon authored
      An overflow issue can occur while calculating sector in
      exfat_cluster_to_sector(). It needs to cast clus's type to sector_t
      before left shifting.
      
      Fixes: 1acf1a56 ("exfat: add in-memory and on-disk structures and headers")
      Cc: stable@vger.kernel.org # v5.7
      Reviewed-by: default avatarSungjong Seo <sj1557.seo@samsung.com>
      Signed-off-by: default avatarNamjae Jeon <namjae.jeon@samsung.com>
      43946b70
  5. 20 Jul, 2020 2 commits
  6. 18 Jul, 2020 2 commits
    • Daniele Albano's avatar
      io_uring: always allow drain/link/hardlink/async sqe flags · 61710e43
      Daniele Albano authored
      We currently filter these for timeout_remove/async_cancel/files_update,
      but we only should be filtering for fixed file and buffer select. This
      also causes a second read of sqe->flags, which isn't needed.
      
      Just check req->flags for the relevant bits. This then allows these
      commands to be used in links, for example, like everything else.
      Signed-off-by: default avatarDaniele Albano <d.albano@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61710e43
    • Jens Axboe's avatar
      io_uring: ensure double poll additions work with both request types · 807abcb0
      Jens Axboe authored
      The double poll additions were centered around doing POLL_ADD on file
      descriptors that use more than one waitqueue (typically one for read,
      one for write) when being polled. However, it can also end up being
      triggered for when we use poll triggered retry. For that case, we cannot
      safely use req->io, as that could be used by the request type itself.
      
      Add a second io_poll_iocb pointer in the structure we allocate for poll
      based retry, and ensure we use the right one from the two paths.
      
      Fixes: 18bceab1 ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      807abcb0
  7. 17 Jul, 2020 1 commit
  8. 16 Jul, 2020 2 commits
  9. 15 Jul, 2020 11 commits
    • David Howells's avatar
      afs: Fix interruption of operations · 811f04ba
      David Howells authored
      The afs filesystem driver allows unstarted operations to be cancelled by
      signal, but most of these can easily be restarted (mkdir for example).  The
      primary culprits for reproducing this are those applications that use
      SIGALRM to display a progress counter.
      
      File lock-extension operation is marked uninterruptible as we have a
      limited time in which to do it, and the release op is marked
      uninterruptible also as if we fail to unlock a file, we'll have to wait 20
      mins before anyone can lock it again.
      
      The store operation logs a warning if it gets interruption, e.g.:
      
      	kAFS: Unexpected error from FS.StoreData -4
      
      because it's run from the background - but it can also be run from
      fdatasync()-type things.  However, store options aren't marked
      interruptible at the moment.
      
      Fix this in the following ways:
      
       (1) Mark store operations as uninterruptible.  It might make sense to
           relax this for certain situations, but I'm not sure how to make sure
           that background store ops aren't affected by signals to foreground
           processes that happen to trigger them.
      
       (2) In afs_get_io_locks(), where we're getting the serialisation lock for
           talking to the fileserver, return ERESTARTSYS rather than EINTR
           because a lot of the operations (e.g. mkdir) are restartable if we
           haven't yet started sending the op to the server.
      
      Fixes: e49c7b2f ("afs: Build an abstraction around an "operation" concept")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      811f04ba
    • Amir Goldstein's avatar
      ovl: fix mount option checks for nfs_export with no upperdir · f0e1266e
      Amir Goldstein authored
      Without upperdir mount option, there is no index dir and the dependency
      checks nfs_export => index for mount options parsing are incorrect.
      
      Allow the combination nfs_export=on,index=off with no upperdir and move
      the check for dependency redirect_dir=nofollow for non-upper mount case
      to mount options parsing.
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      f0e1266e
    • Amir Goldstein's avatar
      ovl: force read-only sb on failure to create index dir · 470c1563
      Amir Goldstein authored
      With index feature enabled, on failure to create index dir, overlay is
      being mounted read-only.  However, we do not forbid user to remount overlay
      read-write.  Fix that by setting ofs->workdir to NULL, which prevents
      remount read-write.
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      470c1563
    • Amir Goldstein's avatar
      ovl: fix regression with re-formatted lower squashfs · a888db31
      Amir Goldstein authored
      Commit 9df085f3 ("ovl: relax requirement for non null uuid of lower
      fs") relaxed the requirement for non null uuid with single lower layer to
      allow enabling index and nfs_export features with single lower squashfs.
      
      Fabian reported a regression in a setup when overlay re-uses an existing
      upper layer and re-formats the lower squashfs image.  Because squashfs
      has no uuid, the origin xattr in upper layer are decoded from the new
      lower layer where they may resolve to a wrong origin file and user may
      get an ESTALE or EIO error on lookup.
      
      To avoid the reported regression while still allowing the new features
      with single lower squashfs, do not allow decoding origin with lower null
      uuid unless user opted-in to one of the new features that require
      following the lower inode of non-dir upper (index, xino, metacopy).
      Reported-by: default avatarFabian <godi.beat@gmx.net>
      Link: https://lore.kernel.org/linux-unionfs/32532923.JtPX5UtSzP@fgdesktop/
      Fixes: 9df085f3 ("ovl: relax requirement for non null uuid of lower fs")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      a888db31
    • Amir Goldstein's avatar
      ovl: fix oops in ovl_indexdir_cleanup() with nfs_export=on · 20396365
      Amir Goldstein authored
      Mounting with nfs_export=on, xfstests overlay/031 triggers a kernel panic
      since v5.8-rc1 overlayfs updates.
      
       overlayfs: orphan index entry (index/00fb1..., ftype=4000, nlink=2)
       BUG: kernel NULL pointer dereference, address: 0000000000000030
       RIP: 0010:ovl_cleanup_and_whiteout+0x28/0x220 [overlay]
      
      Bisect point at commit c21c839b ("ovl: whiteout inode sharing")
      
      Minimal reproducer:
      --------------------------------------------------
      rm -rf l u w m
      mkdir -p l u w m
      mkdir -p l/testdir
      touch l/testdir/testfile
      mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m
      echo 1 > m/testdir/testfile
      umount m
      rm -rf u/testdir
      mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m
      umount m
      --------------------------------------------------
      
      When mount with nfs_export=on, and fail to verify an orphan index, we're
      cleaning this index from indexdir by calling ovl_cleanup_and_whiteout().
      This dereferences ofs->workdir, that was earlier set to NULL.
      
      The design was that ovl->workdir will point at ovl->indexdir, but we are
      assigning ofs->indexdir to ofs->workdir only after ovl_indexdir_cleanup().
      There is no reason not to do it sooner, because once we get success from
      ofs->indexdir = ovl_workdir_create(... there is no turning back.
      Reported-and-tested-by: default avatarMurphy Zhou <jencce.kernel@gmail.com>
      Fixes: c21c839b ("ovl: whiteout inode sharing")
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      20396365
    • Amir Goldstein's avatar
      ovl: relax WARN_ON() when decoding lower directory file handle · 124c2de2
      Amir Goldstein authored
      Decoding a lower directory file handle to overlay path with cold
      inode/dentry cache may go as follows:
      
      1. Decode real lower file handle to lower dir path
      2. Check if lower dir is indexed (was copied up)
      3. If indexed, get the upper dir path from index
      4. Lookup upper dir path in overlay
      5. If overlay path found, verify that overlay lower is the lower dir
         from step 1
      
      On failure to verify step 5 above, user will get an ESTALE error and a
      WARN_ON will be printed.
      
      A mismatch in step 5 could be a result of lower directory that was renamed
      while overlay was offline, after that lower directory has been copied up
      and indexed.
      
      This is a scripted reproducer based on xfstest overlay/052:
      
        # Create lower subdir
        create_dirs
        create_test_files $lower/lowertestdir/subdir
        mount_dirs
        # Copy up lower dir and encode lower subdir file handle
        touch $SCRATCH_MNT/lowertestdir
        test_file_handles $SCRATCH_MNT/lowertestdir/subdir -p -o $tmp.fhandle
        # Rename lower dir offline
        unmount_dirs
        mv $lower/lowertestdir $lower/lowertestdir.new/
        mount_dirs
        # Attempt to decode lower subdir file handle
        test_file_handles $SCRATCH_MNT -p -i $tmp.fhandle
      
      Since this WARN_ON() can be triggered by user we need to relax it.
      
      Fixes: 4b91c30a ("ovl: lookup connected ancestor of dir in inode cache")
      Cc: <stable@vger.kernel.org> # v4.16+
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      124c2de2
    • youngjun's avatar
      ovl: remove not used argument in ovl_check_origin · d78a0dcf
      youngjun authored
      ovl_check_origin outparam 'ctrp' argument not used by caller.  So remove
      this argument.
      Signed-off-by: default avataryoungjun <her0gyugyu@gmail.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      d78a0dcf
    • youngjun's avatar
      ovl: change ovl_copy_up_flags static · 5ac8e802
      youngjun authored
      "ovl_copy_up_flags" is used in copy_up.c.
      so, change it static.
      Signed-off-by: default avataryoungjun <her0gyugyu@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      5ac8e802
    • youngjun's avatar
      ovl: inode reference leak in ovl_is_inuse true case. · 24f14009
      youngjun authored
      When "ovl_is_inuse" true case, trap inode reference not put.  plus adding
      the comment explaining sequence of ovl_is_inuse after ovl_setup_trap.
      
      Fixes: 0be0bfd2 ("ovl: fix regression caused by overlapping layers detection")
      Cc: <stable@vger.kernel.org> # v4.19+
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avataryoungjun <her0gyugyu@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      24f14009
    • Pavel Begunkov's avatar
      io_uring: fix recvmsg memory leak with buffer selection · 681fda8d
      Pavel Begunkov authored
      io_recvmsg() doesn't free memory allocated for struct io_buffer. This can
      causes a leak when used with automatic buffer selection.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      681fda8d
    • Chirantan Ekbote's avatar
      fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS · 31070f6c
      Chirantan Ekbote authored
      The ioctl encoding for this parameter is a long but the documentation says
      it should be an int and the kernel drivers expect it to be an int.  If the
      fuse driver treats this as a long it might end up scribbling over the stack
      of a userspace process that only allocated enough space for an int.
      
      This was previously discussed in [1] and a patch for fuse was proposed in
      [2].  From what I can tell the patch in [2] was nacked in favor of adding
      new, "fixed" ioctls and using those from userspace.  However there is still
      no "fixed" version of these ioctls and the fact is that it's sometimes
      infeasible to change all userspace to use the new one.
      
      Handling the ioctls specially in the fuse driver seems like the most
      pragmatic way for fuse servers to support them without causing crashes in
      userspace applications that call them.
      
      [1]: https://lore.kernel.org/linux-fsdevel/20131126200559.GH20559@hall.aurel32.net/T/
      [2]: https://sourceforge.net/p/fuse/mailman/message/31771759/Signed-off-by: default avatarChirantan Ekbote <chirantan@chromium.org>
      Fixes: 59efec7b ("fuse: implement ioctl support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      31070f6c
  10. 14 Jul, 2020 7 commits
    • Vasily Averin's avatar
      fuse: don't ignore errors from fuse_writepages_fill() · 7779b047
      Vasily Averin authored
      fuse_writepages() ignores some errors taken from fuse_writepages_fill() I
      believe it is a bug: if .writepages is called with WB_SYNC_ALL it should
      either guarantee that all data was successfully saved or return error.
      
      Fixes: 26d614df ("fuse: Implement writepages callback")
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      7779b047
    • Miklos Szeredi's avatar
      fuse: clean up condition for writepage sending · 6ddf3af9
      Miklos Szeredi authored
      fuse_writepages_fill uses following construction:
      
      if (wpa && ap->num_pages &&
          (A || B || C)) {
              action;
      } else if (wpa && D) {
              if (E) {
                      the same action;
              }
      }
      
       - ap->num_pages check is always true and can be removed
      
       - "if" and "else if" calls the same action and can be merged.
      
      Move checking A, B, C, D, E conditions to a helper, add comments.
      Original-patch-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      6ddf3af9
    • Miklos Szeredi's avatar
      fuse: reject options on reconfigure via fsconfig(2) · b330966f
      Miklos Szeredi authored
      Previous patch changed handling of remount/reconfigure to ignore all
      options, including those that are unknown to the fuse kernel fs.  This was
      done for backward compatibility, but this likely only affects the old
      mount(2) API.
      
      The new fsconfig(2) based reconfiguration could possibly be improved.  This
      would make the new API less of a drop in replacement for the old, OTOH this
      is a good chance to get rid of some weirdnesses in the old API.
      
      Several other behaviors might make sense:
      
       1) unknown options are rejected, known options are ignored
      
       2) unknown options are rejected, known options are rejected if the value
       is changed, allowed otherwise
      
       3) all options are rejected
      
      Prior to the backward compatibility fix to ignore all options all known
      options were accepted (1), even if they change the value of a mount
      parameter; fuse_reconfigure() does not look at the config values set by
      fuse_parse_param().
      
      To fix that we'd need to verify that the value provided is the same as set
      in the initial configuration (2).  The major drawback is that this is much
      more complex than just rejecting all attempts at changing options (3);
      i.e. all options signify initial configuration values and don't make sense
      on reconfigure.
      
      This patch opts for (3) with the rationale that no mount options are
      reconfigurable in fuse.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      b330966f
    • Miklos Szeredi's avatar
      fuse: ignore 'data' argument of mount(..., MS_REMOUNT) · e8b20a47
      Miklos Szeredi authored
      The command
      
        mount -o remount -o unknownoption /mnt/fuse
      
      succeeds on kernel versions prior to v5.4 and fails on kernel version at or
      after.  This is because fuse_parse_param() rejects any unrecognised options
      in case of FS_CONTEXT_FOR_RECONFIGURE, just as for FS_CONTEXT_FOR_MOUNT.
      
      This causes a regression in case the fuse filesystem is in fstab, since
      remount sends all options found there to the kernel; even ones that are
      meant for the initial mount and are consumed by the userspace fuse server.
      
      Fix this by ignoring mount options, just as fuse_remount_fs() did prior to
      the conversion to the new API.
      Reported-by: default avatarStefan Priebe <s.priebe@profihost.ag>
      Fixes: c30da2e9 ("fuse: convert to use the new mount API")
      Cc: <stable@vger.kernel.org> # v5.4
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      e8b20a47
    • Miklos Szeredi's avatar
      fuse: use ->reconfigure() instead of ->remount_fs() · 0189a2d3
      Miklos Szeredi authored
      s_op->remount_fs() is only called from legacy_reconfigure(), which is not
      used after being converted to the new API.
      
      Convert to using ->reconfigure().  This restores the previous behavior of
      syncing the filesystem and rejecting MS_MANDLOCK on remount.
      
      Fixes: c30da2e9 ("fuse: convert to use the new mount API")
      Cc: <stable@vger.kernel.org> # v5.4
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      0189a2d3
    • Miklos Szeredi's avatar
      fuse: fix warning in tree_insert() and clean up writepage insertion · c146024e
      Miklos Szeredi authored
      fuse_writepages_fill() calls tree_insert() with ap->num_pages = 0 which
      triggers the following warning:
      
       WARNING: CPU: 1 PID: 17211 at fs/fuse/file.c:1728 tree_insert+0xab/0xc0 [fuse]
       RIP: 0010:tree_insert+0xab/0xc0 [fuse]
       Call Trace:
        fuse_writepages_fill+0x5da/0x6a0 [fuse]
        write_cache_pages+0x171/0x470
        fuse_writepages+0x8a/0x100 [fuse]
        do_writepages+0x43/0xe0
      
      Fix up the warning and clean up the code around rb-tree insertion:
      
       - Rename tree_insert() to fuse_insert_writeback() and make it return the
         conflicting entry in case of failure
      
       - Re-add tree_insert() as a wrapper around fuse_insert_writeback()
      
       - Rename fuse_writepage_in_flight() to fuse_writepage_add() and reverse
         the meaning of the return value to mean
      
          + "true" in case the writepage entry was successfully added
      
          + "false" in case it was in-fligt queued on an existing writepage
             entry's auxiliary list or the existing writepage entry's temporary
             page updated
      
         Switch from fuse_find_writeback() + tree_insert() to
         fuse_insert_writeback()
      
       - Move setting orig_pages to before inserting/updating the entry; this may
         result in the orig_pages value being discarded later in case of an
         in-flight request
      
       - In case of a new writepage entry use fuse_writepage_add()
         unconditionally, only set data->wpa if the entry was added.
      
      Fixes: 6b2fb799 ("fuse: optimize writepages search")
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Original-path-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      c146024e
    • Miklos Szeredi's avatar
      fuse: move rb_erase() before tree_insert() · 69a6487a
      Miklos Szeredi authored
      In fuse_writepage_end() the old writepages entry needs to be removed from
      the rbtree before inserting the new one, otherwise tree_insert() would
      fail.  This is a very rare codepath and no reproducer exists.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      69a6487a
  11. 13 Jul, 2020 1 commit
    • Anna Schumaker's avatar
      NFS: Fix interrupted slots by sending a solo SEQUENCE operation · 913fadc5
      Anna Schumaker authored
      We used to do this before 3453d570, but this was changed to better
      handle the NFS4ERR_SEQ_MISORDERED error code. This commit fixed the slot
      re-use case when the server doesn't receive the interrupted operation,
      but if the server does receive the operation then it could still end up
      replying to the client with mis-matched operations from the reply cache.
      
      We can fix this by sending a SEQUENCE to the server while recovering from
      a SEQ_MISORDERED error when we detect that we are in an interrupted slot
      situation.
      
      Fixes: 3453d570 (NFSv4.1: Avoid false retries when RPC calls are interrupted)
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      913fadc5
  12. 12 Jul, 2020 2 commits