      A customer reported rcu stalls and softlockup warnings on a computer
      with many CPU cores and many many more IO threads trying to write to a
      filesystem that is totally out of space.  Subsequent analysis pointed to
      the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
      which causes a lot of wb_writeback_work to be queued.  The writeback
      worker spends so much time trying to wake the many many threads waiting
      for writeback completion that it trips the softlockup detector, and (in
      this case) the system automatically reboots.
      In addition, they complain that the lengthy xfs_flush_inodes scan traps
      all of those threads in uninterruptible sleep, which hampers their
      ability to kill the program or do anything else to escape the situation.
      If there's thousands of threads trying to write to files on a full
      filesystem, each of those threads will start separate copies of the
      inode flush scan.  This is kind of pointless since we only need one
      scan, so rate limit the inode flush.
  27 Mar, 2020
  07 Feb, 2020
  14 Jan, 2020
      I observed a hang in generic/308 while running fstests on a i686 kernel.
      The hang occurred when trying to purge the pagecache on a large sparse
      file that had a page created past MAX_LFS_FILESIZE, which caused an
      integer overflow in the pagecache xarray and resulted in an infinite
      I then noticed that Linus changed the definition of MAX_LFS_FILESIZE in
      commit 0cc3b0ec ("Clarify (and fix) MAX_LFS_FILESIZE macros") so
      that it is now one page short of the maximum page index on 32-bit
      kernels.  Because the XFS function to compute max offset open-codes the
      2005-era MAX_LFS_FILESIZE computation and neither the vfs nor mm perform
      any sanity checking of s_maxbytes, the code in generic/308 can create a
      page above the pagecache's limit and kaboom.
      Fix all this by setting s_maxbytes to MAX_LFS_FILESIZE directly and
      aborting the mount with a warning if our assumptions ever break.  I have
      no answer for why this seems to have been broken for years and nobody
  18 Nov, 2019
  11 Nov, 2019
  06 Nov, 2019
  05 Nov, 2019
  29 Oct, 2019
  21 Oct, 2019
  06 Sep, 2019
      generic/530 on a machine with enough ram and a non-preemptible
      kernel can run the AGI processing phase of log recovery enitrely out
      of cache. This means it never blocks on locks, never waits for IO
      and runs entirely through the unlinked lists until it either
      completes or blocks and hangs because it has run out of log space.
      It runs out of log space because the background CIL push is
      scheduled but never runs. queue_work() queues the CIL work on the
      current CPU that is busy, and the workqueue code will not run it on
      any other CPU. Hence if the unlinked list processing never yields
      the CPU voluntarily, the push work is delayed indefinitely. This
      results in the CIL aggregating changes until all the log space is
      When the log recoveyr processing evenutally blocks, the CIL flushes
      but because the last iclog isn't submitted for IO because it isn't
      full, the CIL flush never completes and nothing ever moves the log
      head forwards, or indeed inserts anything into the tail of the log,
      and hence nothing is able to get the log moving again and recovery
      There are several problems here, but the two obvious ones from
      the trace are that:
      	a) log recovery does not yield the CPU for over 4 seconds,
      	b) binding CIL pushes to a single CPU is a really bad idea.
      This patch addresses just these two aspects of the problem, and are
      suitable for backporting to work around any issues in older kernels.
      The more fundamental problem of preventing the CIL from consuming
      more than 50% of the log without committing will take more invasive
      and complex work, so will be done as followup work.
  30 Aug, 2019
      Fill in the appropriate limits to avoid inconsistencies
      in the vfs cached inode times when timestamps are
      outside the permitted range.
      Even though some filesystems are read-only, fill in the
      timestamps to reflect the on-disk representation.
