Skip to content
Snippets Groups Projects
  1. Aug 09, 2019
    • Andreas Gruenbacher's avatar
      gfs2: gfs2_walk_metadata fix · a27a0c9b
      Andreas Gruenbacher authored
      
      It turns out that the current version of gfs2_metadata_walker suffers
      from multiple problems that can cause gfs2_hole_size to report an
      incorrect size.  This will confuse fiemap as well as lseek with the
      SEEK_DATA flag.
      
      Fix that by changing gfs2_hole_walker to compute the metapath to the
      first data block after the hole (if any), and compute the hole size
      based on that.
      
      Fixes xfstest generic/490.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: default avatarBob Peterson <rpeterso@redhat.com>
      Cc: stable@vger.kernel.org # v4.18+
      a27a0c9b
  2. Aug 08, 2019
    • Jan Kara's avatar
      bdev: Fixup error handling in blkdev_get() · e91455ba
      Jan Kara authored
      
      Commit 89e524c0 ("loop: Fix mount(2) failure due to race with
      LOOP_SET_FD") converted blkdev_get() to use the new helpers for
      finishing claiming of a block device. However the conversion botched the
      error handling in blkdev_get() and thus the bdev has been marked as held
      even in case __blkdev_get() returned error. This led to occasional
      warnings with block/001 test from blktests like:
      
      kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0
      
      Correct the error handling.
      
      CC: stable@vger.kernel.org
      Fixes: 89e524c0 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e91455ba
    • Greg Kroah-Hartman's avatar
      Revert "kernfs: fix memleak in kernel_ops_readdir()" · 8097c43b
      Greg Kroah-Hartman authored
      
      This reverts commit cc798c83.
      
      Tony writes:
      	Somehow this causes a regression in Linux next for me where I'm
      	seeing lots of sysfs entries now missing under
      	/sys/bus/platform/devices.
      
      	For example, I now only see one .serial entry show up in sysfs.
      	Things work again if I revert commit cc798c83 ("kernfs: fix
      	memleak inkernel_ops_readdir()"). Any ideas why that would be?
      
      Tejun says:
      	Ugh, you're right.  It can get double-put cuz ctx->pos is put by
      	release too.
      
      So reverting it for now.
      
      Reported-by: default avatarTony Lindgren <tony@atomide.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Fixes: cc798c83 ("kernfs: fix memleak in kernel_ops_readdir()")
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8097c43b
  3. Aug 07, 2019
    • Jens Axboe's avatar
      block: fix O_DIRECT error handling for bio fragments · e15c2ffa
      Jens Axboe authored
      
      0eb6ddfb tried to fix this up, but introduced a use-after-free
      of dio. Additionally, we still had an issue with error handling,
      as reported by Darrick:
      
      "I noticed a regression in xfs/747 (an unreleased xfstest for the
      xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
      to a simpler reproducer:
      
      error-test: 0 209 linear 8:48 0
      error-test: 209 1 error
      error-test: 210 6446894 linear 8:48 210
      
      Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
      for sector 209 and to pass the io to the scsi device everywhere else.
      
      On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
      (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
      EIO like you'd expect:
      
      pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
      pread: Input/output error
      +++ exited with 0 +++
      
      But doing it with a larger buffer succeeds(!):
      
      pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
      read 1146880/1146880 bytes at offset 0
      1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
      +++ exited with 0 +++
      
      (Note that the part of the buffer corresponding to the dm-error area is
      uninitialized)
      
      On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
      only change between rc2 and rc3 is commit 0eb6ddfb ("block: Fix
      __blkdev_direct_IO() for bio fragments").
      
      AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
      split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
      and a second one for the 96k after that."
      
      Fix this by noting that it's always safe to dereference dio if we get
      BLK_QC_T_EAGAIN returned, as end_io hasn't been run for that case. So
      we can safely increment the dio size before calling submit_bio(), and
      then decrement it on failure (not that it really matters, as the bio
      and dio are going away).
      
      For error handling, return to the original method of just using 'ret'
      for tracking the error, and the size tracking in dio->size.
      
      Fixes: 0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
      Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      Reported-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e15c2ffa
    • Trond Myklebust's avatar
      NFSv4: Ensure state recovery handles ETIMEDOUT correctly · 67e7b52d
      Trond Myklebust authored
      
      Ensure that the state recovery code handles ETIMEDOUT correctly,
      and also that we set RPC_TASK_TIMEOUT when recovering open state.
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      67e7b52d
  4. Aug 06, 2019
  5. Aug 05, 2019
  6. Aug 03, 2019
    • Paul Wise's avatar
      coredump: split pipe command whitespace before expanding template · 315c6926
      Paul Wise authored
      Save the offsets of the start of each argument to avoid having to update
      pointers to each argument after every corename krealloc and to avoid
      having to duplicate the memory for the dump command.
      
      Executable names containing spaces were previously being expanded from
      %e or %E and then split in the middle of the filename.  This is
      incorrect behaviour since an argument list can represent arguments with
      spaces.
      
      The splitting could lead to extra arguments being passed to the core
      dump handler that it might have interpreted as options or ignored
      completely.
      
      Core dump handlers that are not aware of this Linux kernel issue will be
      using %e or %E without considering that it may be split and so they will
      be vulnerable to processes with spaces in their names breaking their
      argument list.  If their internals are otherwise well written, such as
      if they are written in shell but quote arguments, they will work better
      after this change than before.  If they are not well written, then there
      is a slight chance of breakage depending on the details of the code but
      they will already be fairly broken by the split filenames.
      
      Core dump handlers that are aware of this Linux kernel issue will be
      placing %e or %E as the last item in their core_pattern and then
      aggregating all of the remaining arguments into one, separated by
      spaces.  Alternatively they will be obtaining the filename via other
      methods.  Both of these will be compatible with the new arrangement.
      
      A side effect from this change is that unknown template types (for
      example %z) result in an empty argument to the dump handler instead of
      the argument being dropped.  This is a desired change as:
      
      It is easier for dump handlers to process empty arguments than dropped
      ones, especially if they are written in shell or don't pass each
      template item with a preceding command-line option in order to
      differentiate between individual template types.  Most core_patterns in
      the wild do not use options so they can confuse different template types
      (especially numeric ones) if an earlier one gets dropped in old kernels.
      If the kernel introduces a new template type and a core_pattern uses it,
      the core dump handler might not expect that the argument can be dropped
      in old kernels.
      
      For example, this can result in security issues when %d is dropped in
      old kernels.  This happened with the corekeeper package in Debian and
      resulted in the interface between corekeeper and Linux having to be
      rewritten to use command-line options to differentiate between template
      types.
      
      The core_pattern for most core dump handlers is written by the handler
      author who would generally not insert unknown template types so this
      change should be compatible with all the core dump handlers that exist.
      
      Link: http://lkml.kernel.org/r/20190528051142.24939-1-pabs3@bonedaddy.net
      
      
      Fixes: 74aadce9 ("core_pattern: allow passing of arguments to user mode helper when core_pattern is a pipe")
      Signed-off-by: default avatarPaul Wise <pabs3@bonedaddy.net>
      Reported-by: Jakub Wilk <jwilk@jwilk.net> [https://bugs.debian.org/924398]
      Reported-by: Paul Wise <pabs3@bonedaddy.net> [https://lore.kernel.org/linux-fsdevel/c8b7ecb8508895bf4adb62a748e2ea2c71854597.camel@bonedaddy.net/
      
      ]
      Suggested-by: default avatarJakub Wilk <jwilk@jwilk.net>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      315c6926
    • YueHaibing's avatar
      ocfs2: remove set but not used variable 'last_hash' · 7bc36e3c
      YueHaibing authored
      Fixes gcc '-Wunused-but-set-variable' warning:
      
        fs/ocfs2/xattr.c: In function ocfs2_xattr_bucket_find:
        fs/ocfs2/xattr.c:3828:6: warning: variable last_hash set but not used [-Wunused-but-set-variable]
      
      It's never used and can be removed.
      
      Link: http://lkml.kernel.org/r/20190716132110.34836-1-yuehaibing@huawei.com
      
      
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Acked-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7bc36e3c
  7. Aug 01, 2019
    • Damien Le Moal's avatar
      block: Fix __blkdev_direct_IO() for bio fragments · 0eb6ddfb
      Damien Le Moal authored
      
      The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
      (patch 6a43074e) introduced two problems with BIO fragment handling
      for direct IOs:
      1) The dio size processed is calculated by incrementing the ret variable
      by the size of the bio fragment issued for the dio. However, this size
      is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
      which may result in referencing the bi_size value after the bio
      completed, resulting in an incorrect value use.
      2) The ret variable is not incremented by the size of the last bio
      fragment issued for the bio, leading to an invalid IO size being
      returned to the user.
      
      Fix both problem by using dio->size (which is incremented before the bio
      submission) to update the value of ret after bio submissions, including
      for the last bio fragment issued.
      
      Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      Reported-by: default avatarMasato Suzuki <masato.suzuki@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0eb6ddfb
  8. Jul 31, 2019
    • Andreas Gruenbacher's avatar
      gfs2: Inode dirtying fix · 706cb549
      Andreas Gruenbacher authored
      
      With the recent iomap write page reclaim deadlock fix, it turns out that the
      GLF_DIRTY flag isn't always set when it needs to be anymore: previously, this
      happened as a side effect of always adding the inode buffer head to the current
      transaction with gfs2_trans_add_meta, but this isn't happening consistently
      anymore.  Fix by removing an additional unnecessary gfs2_trans_add_meta call
      and by setting the GLF_DIRTY flag in gfs2_iomap_end.
      
      (The GLF_DIRTY flag causes inode_go_sync to flush the transaction log when
      syncing out the glock of that inode.  When the flag isn't set, inode_go_sync
      will skip inodes, including ones with an i_state of I_DIRTY_PAGES, which will
      lead to cluster incoherency.)
      
      In addition, in gfs2_iomap_page_done, if the metadata has changed, mark the
      inode as I_DIRTY_DATASYNC to have the inode added to the current transaction:
      we don't expect metadata to change here, but let's err on the safe side.
      
      Fixes: d0a22a4b ("gfs2: Fix iomap write page reclaim deadlock");
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      706cb549
    • Al Viro's avatar
      Unbreak mount_capable() · c2c44ec2
      Al Viro authored
      
      In "consolidate the capability checks in sget_{fc,userns}())" the
      wrong argument had been passed to mount_capable() by sget_fc().
      That mistake had been further obscured later, when switching
      mount_capable() to fs_context has moved the calculation of
      bogus argument from sget_fc() to mount_capable() itself.  It
      should've been fc->user_ns all along.
      
      Screwed-up-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Reported-by: default avatarChristian Brauner <christian@brauner.io>
      Tested-by: default avatarChristian Brauner <christian@brauner.io>
      Reviewed-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c2c44ec2
    • Jackie Liu's avatar
      io_uring: fix KASAN use after free in io_sq_wq_submit_work · d0ee8791
      Jackie Liu authored
      
      [root@localhost ~]# ./liburing/test/link
      
      QEMU Standard PC report that:
      
      [   29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622f1d2-dirty #86
      [   29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      [   29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work
      [   29.379929] Call Trace:
      [   29.379953]  dump_stack+0xa9/0x10e
      [   29.379970]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.379986]  print_address_description.cold.6+0x9/0x317
      [   29.379999]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380010]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380026]  __kasan_report.cold.7+0x1a/0x34
      [   29.380044]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380061]  kasan_report+0xe/0x12
      [   29.380076]  io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380104]  ? io_sq_thread+0xaf0/0xaf0
      [   29.380152]  process_one_work+0xb59/0x19e0
      [   29.380184]  ? pwq_dec_nr_in_flight+0x2c0/0x2c0
      [   29.380221]  worker_thread+0x8c/0xf40
      [   29.380248]  ? __kthread_parkme+0xab/0x110
      [   29.380265]  ? process_one_work+0x19e0/0x19e0
      [   29.380278]  kthread+0x30b/0x3d0
      [   29.380292]  ? kthread_create_on_node+0xe0/0xe0
      [   29.380311]  ret_from_fork+0x3a/0x50
      
      [   29.380635] Allocated by task 209:
      [   29.381255]  save_stack+0x19/0x80
      [   29.381268]  __kasan_kmalloc.constprop.6+0xc1/0xd0
      [   29.381279]  kmem_cache_alloc+0xc0/0x240
      [   29.381289]  io_submit_sqe+0x11bc/0x1c70
      [   29.381300]  io_ring_submit+0x174/0x3c0
      [   29.381311]  __x64_sys_io_uring_enter+0x601/0x780
      [   29.381322]  do_syscall_64+0x9f/0x4d0
      [   29.381336]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [   29.381633] Freed by task 84:
      [   29.382186]  save_stack+0x19/0x80
      [   29.382198]  __kasan_slab_free+0x11d/0x160
      [   29.382210]  kmem_cache_free+0x8c/0x2f0
      [   29.382220]  io_put_req+0x22/0x30
      [   29.382230]  io_sq_wq_submit_work+0x28b/0xe90
      [   29.382241]  process_one_work+0xb59/0x19e0
      [   29.382251]  worker_thread+0x8c/0xf40
      [   29.382262]  kthread+0x30b/0x3d0
      [   29.382272]  ret_from_fork+0x3a/0x50
      
      [   29.382569] The buggy address belongs to the object at ffff888067172140
                      which belongs to the cache io_kiocb of size 224
      [   29.384692] The buggy address is located 120 bytes inside of
                      224-byte region [ffff888067172140, ffff888067172220)
      [   29.386723] The buggy address belongs to the page:
      [   29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0
      [   29.387587] flags: 0x100000000000200(slab)
      [   29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180
      [   29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
      [   29.387624] page dumped because: kasan: bad access detected
      
      [   29.387920] Memory state around the buggy address:
      [   29.388771]  ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [   29.390062]  ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
      [   29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [   29.392578]                                         ^
      [   29.393480]  ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.394744]  ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.396003] ==================================================================
      [   29.397260] Disabling lock debugging due to kernel taint
      
      io_sq_wq_submit_work free and read req again.
      
      Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
      Cc: linux-block@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: f7b76ac9 ("io_uring: fix counter inc/dec mismatch in async_list")
      Signed-off-by: default avatarJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d0ee8791
  9. Jul 30, 2019
    • Arnd Bergmann's avatar
      compat_ioctl: pppoe: fix PPPOEIOCSFWD handling · 055d8824
      Arnd Bergmann authored
      
      Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
      linux-2.5.69 along with hundreds of other commands, but was always broken
      sincen only the structure is compatible, but the command number is not,
      due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
      sockaddr_pppox)), which is different on 64-bit architectures.
      
      Guillaume Nault adds:
      
        And the implementation was broken until 2016 (see 29e73269 ("pppoe:
        fix reference counting in PPPoE proxy")), and nobody ever noticed. I
        should probably have removed this ioctl entirely instead of fixing it.
        Clearly, it has never been used.
      
      Fix it by adding a compat_ioctl handler for all pppoe variants that
      translates the command number and then calls the regular ioctl function.
      
      All other ioctl commands handled by pppoe are compatible between 32-bit
      and 64-bit, and require compat_ptr() conversion.
      
      This should apply to all stable kernels.
      
      Acked-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      055d8824
    • Jan Kara's avatar
      loop: Fix mount(2) failure due to race with LOOP_SET_FD · 89e524c0
      Jan Kara authored
      
      Commit 33ec3e53 ("loop: Don't change loop device under exclusive
      opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
      while it updates loop device binding. However this can make perfectly
      valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
      temporarily the exclusive bdev reference in cases like this:
      
      for i in {a..z}{a..z}; do
              dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
              mkfs.ext2 $i.image
              mkdir mnt$i
      done
      
      echo "Run"
      for i in {a..z}{a..z}; do
              mount -o loop -t ext2 $i.image mnt$i &
      done
      
      Fix the problem by not getting full exclusive bdev reference in
      LOOP_SET_FD but instead just mark the bdev as being claimed while we
      update the binding information. This just blocks new exclusive openers
      instead of failing them with EBUSY thus fixing the problem.
      
      Fixes: 33ec3e53 ("loop: Don't change loop device under exclusive opener")
      Cc: stable@vger.kernel.org
      Tested-by: default avatarKai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      89e524c0
    • Jia-Ju Bai's avatar
      xfs: Fix possible null-pointer dereferences in xchk_da_btree_block_check_sibling() · afa1d96d
      Jia-Ju Bai authored
      
      In xchk_da_btree_block_check_sibling(), there is an if statement on
      line 274 to check whether ds->state->altpath.blk[level].bp is NULL:
          if (ds->state->altpath.blk[level].bp)
      
      When ds->state->altpath.blk[level].bp is NULL, it is used on line 281:
          xfs_trans_brelse(..., ds->state->altpath.blk[level].bp);
              struct xfs_buf_log_item *bip = bp->b_log_item;
              ASSERT(bp->b_transp == tp);
      
      Thus, possible null-pointer dereferences may occur.
      
      To fix these bugs, ds->state->altpath.blk[level].bp is checked before
      being used.
      
      These bugs are found by a static analysis tool STCheck written by us.
      
      Signed-off-by: default avatarJia-Ju Bai <baijiaju1990@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      afa1d96d
    • Filipe Manana's avatar
      Btrfs: fix deadlock between fiemap and transaction commits · a6d155d2
      Filipe Manana authored
      
      The fiemap handler locks a file range that can have unflushed delalloc,
      and after locking the range, it tries to attach to a running transaction.
      If the running transaction started its commit, that is, it is in state
      TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
      flushoncommit option or the transaction is creating a snapshot for the
      subvolume that contains the file that fiemap is operating on, we end up
      deadlocking. This happens because fiemap is blocked on the transaction,
      waiting for it to complete, and the transaction is waiting for the flushed
      dealloc to complete, which requires locking the file range that the fiemap
      task already locked. The following stack traces serve as an example of
      when this deadlock happens:
      
        (...)
        [404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        [404571.515956] Call Trace:
        [404571.516360]  ? __schedule+0x3ae/0x7b0
        [404571.516730]  schedule+0x3a/0xb0
        [404571.517104]  lock_extent_bits+0x1ec/0x2a0 [btrfs]
        [404571.517465]  ? remove_wait_queue+0x60/0x60
        [404571.517832]  btrfs_finish_ordered_io+0x292/0x800 [btrfs]
        [404571.518202]  normal_work_helper+0xea/0x530 [btrfs]
        [404571.518566]  process_one_work+0x21e/0x5c0
        [404571.518990]  worker_thread+0x4f/0x3b0
        [404571.519413]  ? process_one_work+0x5c0/0x5c0
        [404571.519829]  kthread+0x103/0x140
        [404571.520191]  ? kthread_create_worker_on_cpu+0x70/0x70
        [404571.520565]  ret_from_fork+0x3a/0x50
        [404571.520915] kworker/u8:6    D    0 31651      2 0x80004000
        [404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
        (...)
        [404571.537000] fsstress        D    0 13117  13115 0x00004000
        [404571.537263] Call Trace:
        [404571.537524]  ? __schedule+0x3ae/0x7b0
        [404571.537788]  schedule+0x3a/0xb0
        [404571.538066]  wait_current_trans+0xc8/0x100 [btrfs]
        [404571.538349]  ? remove_wait_queue+0x60/0x60
        [404571.538680]  start_transaction+0x33c/0x500 [btrfs]
        [404571.539076]  btrfs_check_shared+0xa3/0x1f0 [btrfs]
        [404571.539513]  ? extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.539866]  extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.540170]  do_vfs_ioctl+0x526/0x6f0
        [404571.540436]  ksys_ioctl+0x70/0x80
        [404571.540734]  __x64_sys_ioctl+0x16/0x20
        [404571.540997]  do_syscall_64+0x60/0x1d0
        [404571.541279]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
        [404571.543729] btrfs           D    0 14210  14208 0x00004000
        [404571.544023] Call Trace:
        [404571.544275]  ? __schedule+0x3ae/0x7b0
        [404571.544526]  ? wait_for_completion+0x112/0x1a0
        [404571.544795]  schedule+0x3a/0xb0
        [404571.545064]  schedule_timeout+0x1ff/0x390
        [404571.545351]  ? lock_acquire+0xa6/0x190
        [404571.545638]  ? wait_for_completion+0x49/0x1a0
        [404571.545890]  ? wait_for_completion+0x112/0x1a0
        [404571.546228]  wait_for_completion+0x131/0x1a0
        [404571.546503]  ? wake_up_q+0x70/0x70
        [404571.546775]  btrfs_wait_ordered_extents+0x27c/0x400 [btrfs]
        [404571.547159]  btrfs_commit_transaction+0x3b0/0xae0 [btrfs]
        [404571.547449]  ? btrfs_mksubvol+0x4a4/0x640 [btrfs]
        [404571.547703]  ? remove_wait_queue+0x60/0x60
        [404571.547969]  btrfs_mksubvol+0x605/0x640 [btrfs]
        [404571.548226]  ? __sb_start_write+0xd4/0x1c0
        [404571.548512]  ? mnt_want_write_file+0x24/0x50
        [404571.548789]  btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs]
        [404571.549048]  btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs]
        [404571.549307]  btrfs_ioctl+0x133f/0x3150 [btrfs]
        [404571.549549]  ? mem_cgroup_charge_statistics+0x4c/0xd0
        [404571.549792]  ? mem_cgroup_commit_charge+0x84/0x4b0
        [404571.550064]  ? __handle_mm_fault+0xe3e/0x11f0
        [404571.550306]  ? do_raw_spin_unlock+0x49/0xc0
        [404571.550608]  ? _raw_spin_unlock+0x24/0x30
        [404571.550976]  ? __handle_mm_fault+0xedf/0x11f0
        [404571.551319]  ? do_vfs_ioctl+0xa2/0x6f0
        [404571.551659]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [404571.552087]  do_vfs_ioctl+0xa2/0x6f0
        [404571.552355]  ksys_ioctl+0x70/0x80
        [404571.552621]  __x64_sys_ioctl+0x16/0x20
        [404571.552864]  do_syscall_64+0x60/0x1d0
        [404571.553104]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
      
      If we were joining the transaction instead of attaching to it, we would
      not risk a deadlock because a join only blocks if the transaction is in a
      state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
      flush performed by a transaction is done before it reaches that state,
      when it is in the state TRANS_STATE_COMMIT_START. However a transaction
      join is intended for use cases where we do modify the filesystem, and
      fiemap only needs to peek at delayed references from the current
      transaction in order to determine if extents are shared, and, besides
      that, when there is no current transaction or when it blocks to wait for
      a current committing transaction to complete, it creates a new transaction
      without reserving any space. Such unnecessary transactions, besides doing
      unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
      rotation of the precious backup roots.
      
      So fix this by adding a new transaction join variant, named join_nostart,
      which behaves like the regular join, but it does not create a transaction
      when none currently exists or after waiting for a committing transaction
      to complete.
      
      Fixes: 03628cdb ("Btrfs: do not start a transaction during fiemap")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a6d155d2
    • Filipe Manana's avatar
      Btrfs: fix race leading to fs corruption after transaction abort · cb2d3dad
      Filipe Manana authored
      
      When one transaction is finishing its commit, it is possible for another
      transaction to start and enter its initial commit phase as well. If the
      first ends up getting aborted, we have a small time window where the second
      transaction commit does not notice that the previous transaction aborted
      and ends up committing, writing a superblock that points to btrees that
      reference extent buffers (nodes and leafs) that were not persisted to disk.
      The consequence is that after mounting the filesystem again, we will be
      unable to load some btree nodes/leafs, either because the content on disk
      is either garbage (or just zeroes) or corresponds to the old content of a
      previouly COWed or deleted node/leaf, resulting in the well known error
      messages "parent transid verify failed on ...".
      The following sequence diagram illustrates how this can happen.
      
              CPU 1                                           CPU 2
      
       <at transaction N>
      
       btrfs_commit_transaction()
         (...)
         --> sets transaction state to
             TRANS_STATE_UNBLOCKED
         --> sets fs_info->running_transaction
             to NULL
      
                                                          (...)
                                                          btrfs_start_transaction()
                                                            start_transaction()
                                                              wait_current_trans()
                                                                --> returns immediately
                                                                    because
                                                                    fs_info->running_transaction
                                                                    is NULL
                                                              join_transaction()
                                                                --> creates transaction N + 1
                                                                --> sets
                                                                    fs_info->running_transaction
                                                                    to transaction N + 1
                                                                --> adds transaction N + 1 to
                                                                    the fs_info->trans_list list
                                                              --> returns transaction handle
                                                                  pointing to the new
                                                                  transaction N + 1
                                                          (...)
      
                                                          btrfs_sync_file()
                                                            btrfs_start_transaction()
                                                              --> returns handle to
                                                                  transaction N + 1
                                                            (...)
      
         btrfs_write_and_wait_transaction()
           --> writeback of some extent
               buffer fails, returns an
      	 error
         btrfs_handle_fs_error()
           --> sets BTRFS_FS_STATE_ERROR in
               fs_info->fs_state
         --> jumps to label "scrub_continue"
         cleanup_transaction()
           btrfs_abort_transaction(N)
             --> sets BTRFS_FS_STATE_TRANS_ABORTED
                 flag in fs_info->fs_state
             --> sets aborted field in the
                 transaction and transaction
      	   handle structures, for
                 transaction N only
           --> removes transaction from the
               list fs_info->trans_list
                                                            btrfs_commit_transaction(N + 1)
                                                              --> transaction N + 1 was not
      							    aborted, so it proceeds
                                                              (...)
                                                              --> sets the transaction's state
                                                                  to TRANS_STATE_COMMIT_START
                                                              --> does not find the previous
                                                                  transaction (N) in the
                                                                  fs_info->trans_list, so it
                                                                  doesn't know that transaction
                                                                  was aborted, and the commit
                                                                  of transaction N + 1 proceeds
                                                              (...)
                                                              --> sets transaction N + 1 state
                                                                  to TRANS_STATE_UNBLOCKED
                                                              btrfs_write_and_wait_transaction()
                                                                --> succeeds writing all extent
                                                                    buffers created in the
                                                                    transaction N + 1
                                                              write_all_supers()
                                                                 --> succeeds
                                                                 --> we now have a superblock on
                                                                     disk that points to trees
                                                                     that refer to at least one
                                                                     extent buffer that was
                                                                     never persisted
      
      So fix this by updating the transaction commit path to check if the flag
      BTRFS_FS_STATE_TRANS_ABORTED is set on fs_info->fs_state if after setting
      the transaction to the TRANS_STATE_COMMIT_START we do not find any previous
      transaction in the fs_info->trans_list. If the flag is set, just fail the
      transaction commit with -EROFS, as we do in other places. The exact error
      code for the previous transaction abort was already logged and reported.
      
      Fixes: 49b25e05 ("btrfs: enhance transaction abort infrastructure")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb2d3dad
    • Filipe Manana's avatar
      Btrfs: fix incremental send failure after deduplication · b4f9a1a8
      Filipe Manana authored
      When doing an incremental send operation we can fail if we previously did
      deduplication operations against a file that exists in both snapshots. In
      that case we will fail the send operation with -EIO and print a message
      to dmesg/syslog like the following:
      
        BTRFS error (device sdc): Send: inconsistent snapshot, found updated \
        extent for inode 257 without updated inode item, send root is 258, \
        parent root is 257
      
      This requires that we deduplicate to the same file in both snapshots for
      the same amount of times on each snapshot. The issue happens because a
      deduplication only updates the iversion of an inode and does not update
      any other field of the inode, therefore if we deduplicate the file on
      each snapshot for the same amount of time, the inode will have the same
      iversion value (stored as the "sequence" field on the inode item) on both
      snapshots, therefore it will be seen as unchanged between in the send
      snapshot while there are new/updated/deleted extent items when comparing
      to the parent snapshot. This makes the send operation return -EIO and
      print an error message.
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        # Create our first file. The first half of the file has several 64Kb
        # extents while the second half as a single 512Kb extent.
        $ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo
      
        # Create the base snapshot and the parent send stream from it.
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
        $ btrfs send -f /tmp/1.snap /mnt/mysnap1
      
        # Create our second file, that has exactly the same data as the first
        # file.
        $ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar
      
        # Create the second snapshot, used for the incremental send, before
        # doing the file deduplication.
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
      
        # Now before creating the incremental send stream:
        #
        # 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This
        #    will drop several extent items and add a new one, also updating
        #    the inode's iversion (sequence field in inode item) by 1, but not
        #    any other field of the inode;
        #
        # 2) Deduplicate into a different subrange of file foo in snapshot
        #    mysnap2. This will replace an extent item with a new one, also
        #    updating the inode's iversion by 1 but not any other field of the
        #    inode.
        #
        # After these two deduplication operations, the inode items, for file
        # foo, are identical in both snapshots, but we have different extent
        # items for this inode in both snapshots. We want to check this doesn't
        # cause send to fail with an error or produce an incorrect stream.
      
        $ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo
        $ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo
      
        # Create the incremental send stream.
        $ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2
        ERROR: send ioctl failed with -5: Input/output error
      
      This issue started happening back in 2015 when deduplication was updated
      to not update the inode's ctime and mtime and update only the iversion.
      Back then we would hit a BUG_ON() in send, but later in 2016 send was
      updated to return -EIO and print the error message instead of doing the
      BUG_ON().
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933
      
      
      Fixes: 1c919a5e ("btrfs: don't update mtime/ctime on deduped inodes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4f9a1a8
  10. Jul 29, 2019
  11. Jul 27, 2019
  12. Jul 26, 2019
Loading