1. 12 Jul, 2019 3 commits
  2. 06 Jul, 2019 1 commit
    • Linus Torvalds's avatar
      Revert "mm: page cache: store only head pages in i_pages" · 69bf4b6b
      Linus Torvalds authored
      This reverts commit 5fd4ca2d.
      
      Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
      __delete_from_swap_cache() to trigger:
      
         page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
         anon
         flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
         raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
         raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
         page dumped because: VM_BUG_ON_PAGE(entry != page)
         page->mem_cgroup:ffff978433ace000
         ------------[ cut here ]------------
         kernel BUG at mm/swap_state.c:170!
         invalid opcode: 0000 [#1] SMP NOPTI
         CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
         Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
         RIP: 0010:__delete_from_swap_cache+0x20d/0x240
         Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
         RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
         RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
         RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
         RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
         R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
         R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
         FS:  0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
         Call Trace:
          delete_from_swap_cache+0x46/0xa0
          try_to_free_swap+0xbc/0x110
          swap_writepage+0x13/0x70
          pageout.isra.0+0x13c/0x350
          shrink_page_list+0xc14/0xdf0
          shrink_inactive_list+0x1e5/0x3c0
          shrink_node_memcg+0x202/0x760
          shrink_node+0xe0/0x470
          balance_pgdat+0x2d1/0x510
          kswapd+0x220/0x420
          kthread+0xfb/0x130
          ret_from_fork+0x22/0x40
      
      and it's not immediately obvious why it happens.  It's too late in the
      rc cycle to do anything but revert for now.
      
      Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/Reported-and-bisected-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69bf4b6b
  3. 20 Jun, 2019 1 commit
  4. 09 Jun, 2019 3 commits
  5. 21 May, 2019 1 commit
  6. 14 May, 2019 4 commits
  7. 15 Mar, 2019 3 commits
    • Linus Torvalds's avatar
      filemap: add a comment about FAULT_FLAG_RETRY_NOWAIT behavior · 8b0f9fa2
      Linus Torvalds authored
      I thought Josef Bacik's patch to drop the mmap_sem was buggy, because
      when looking at the error cases, there was one case where we returned
      VM_FAULT_RETRY without actually dropping the mmap_sem.
      
      Josef had to explain to me (using small words) that yes, that's actually
      what we're supposed to do, and his patch was correct.  Which not only
      convinced me he knew what he was doing and I should stop arguing with
      him, but also that I should add a comment to the case I was confused
      about.
      Patiently-pointed-out-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b0f9fa2
    • Josef Bacik's avatar
      filemap: drop the mmap_sem for all blocking operations · 6b4c9f44
      Josef Bacik authored
      Currently we only drop the mmap_sem if there is contention on the page
      lock.  The idea is that we issue readahead and then go to lock the page
      while it is under IO and we want to not hold the mmap_sem during the IO.
      
      The problem with this is the assumption that the readahead does anything.
      In the case that the box is under extreme memory or IO pressure we may end
      up not reading anything at all for readahead, which means we will end up
      reading in the page under the mmap_sem.
      
      Even if the readahead does something, it could get throttled because of io
      pressure on the system and the process is in a lower priority cgroup.
      
      Holding the mmap_sem while doing IO is problematic because it can cause
      system-wide priority inversions.  Consider some large company that does a
      lot of web traffic.  This large company has load balancing logic in it's
      core web server, cause some engineer thought this was a brilliant plan.
      This load balancing logic gets statistics from /proc about the system,
      which trip over processes mmap_sem for various reasons.  Now the web
      server application is in a protected cgroup, but these other processes may
      not be, and if they are being throttled while their mmap_sem is held we'll
      stall, and cause this nice death spiral.
      
      Instead rework filemap fault path to drop the mmap sem at any point that
      we may do IO or block for an extended period of time.  This includes while
      issuing readahead, locking the page, or needing to call ->readpage because
      readahead did not occur.  Then once we have a fully uptodate page we can
      return with VM_FAULT_RETRY and come back again to find our nicely in-cache
      page that was gotten outside of the mmap_sem.
      
      This patch also adds a new helper for locking the page with the mmap_sem
      dropped.  This doesn't make sense currently as generally speaking if the
      page is already locked it'll have been read in (unless there was an error)
      before it was unlocked.  However a forthcoming patchset will change this
      with the ability to abort read-ahead bio's if necessary, making it more
      likely that we could contend for a page lock and still have a not uptodate
      page.  This allows us to deal with this case by grabbing the lock and
      issuing the IO without the mmap_sem held, and then returning
      VM_FAULT_RETRY to come back around.
      
      [josef@toxicpanda.com: v6]
        Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
      [kirill@shutemov.name: fix race in filemap_fault()]
        Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.comSigned-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b4c9f44
    • Josef Bacik's avatar
      filemap: kill page_cache_read usage in filemap_fault · a75d4c33
      Josef Bacik authored
      Patch series "drop the mmap_sem when doing IO in the fault path", v6.
      
      Now that we have proper isolation in place with cgroups2 we have started
      going through and fixing the various priority inversions.  Most are all
      gone now, but this one is sort of weird since it's not necessarily a
      priority inversion that happens within the kernel, but rather because of
      something userspace does.
      
      We have giant applications that we want to protect, and parts of these
      giant applications do things like watch the system state to determine how
      healthy the box is for load balancing and such.  This involves running
      'ps' or other such utilities.  These utilities will often walk
      /proc/<pid>/whatever, and these files can sometimes need to
      down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
      we are stress testing that sometimes our protected application has latency
      spikes trying to get the mmap_sem for tasks that are in lower priority
      cgroups.
      
      This is because any down_write() on a semaphore essentially turns it into
      a mutex, so even if we currently have it held for reading, any new readers
      will not be allowed on to keep from starving the writer.  This is fine,
      except a lower priority task could be stuck doing IO because it has been
      throttled to the point that its IO is taking much longer than normal.  But
      because a higher priority group depends on this completing it is now stuck
      behind lower priority work.
      
      In order to avoid this particular priority inversion we want to use the
      existing retry mechanism to stop from holding the mmap_sem at all if we
      are going to do IO.  This already exists in the read case sort of, but
      needed to be extended for more than just grabbing the page lock.  With
      io.latency we throttle at submit_bio() time, so the readahead stuff can
      block and even page_cache_read can block, so all these paths need to have
      the mmap_sem dropped.
      
      The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
      because we have to reserve space for the dirty page, which can be a very
      expensive operation.  We use the same retry method as the read path, and
      simply cache the page and verify the page is still setup properly the next
      pass through ->page_mkwrite().
      
      I've tested these patches with xfstests and there are no regressions.
      
      This patch (of 3):
      
      If we do not have a page at filemap_fault time we'll do this weird forced
      page_cache_read thing to populate the page, and then drop it again and
      loop around and find it.  This makes for 2 ways we can read a page in
      filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
      flag so that pagecache_get_page() will return a unlocked page that's in
      pagecache.  Then use the normal page locking and readpage logic already in
      filemap_fault.  This simplifies the no page in page cache case
      significantly.
      
      [akpm@linux-foundation.org: fix comment text]
      [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
        Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
      Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.comSigned-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a75d4c33
  8. 14 Mar, 2019 1 commit
  9. 06 Mar, 2019 5 commits
  10. 04 Jan, 2019 1 commit
  11. 28 Dec, 2018 3 commits
  12. 29 Oct, 2018 5 commits
  13. 26 Oct, 2018 6 commits
  14. 23 Oct, 2018 1 commit
  15. 21 Oct, 2018 2 commits