Skip to content
  • Josef Bacik's avatar
    filemap: kill page_cache_read usage in filemap_fault · a75d4c33
    Josef Bacik authored
    Patch series "drop the mmap_sem when doing IO in the fault path", v6.
    
    Now that we have proper isolation in place with cgroups2 we have started
    going through and fixing the various priority inversions.  Most are all
    gone now, but this one is sort of weird since it's not necessarily a
    priority inversion that happens within the kernel, but rather because of
    something userspace does.
    
    We have giant applications that we want to protect, and parts of these
    giant applications do things like watch the system state to determine how
    healthy the box is for load balancing and such.  This involves running
    'ps' or other such utilities.  These utilities will often walk
    /proc/<pid>/whatever, and these files can sometimes need to
    down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
    we are stress testing that sometimes our protected application has latency
    spikes trying to get the mmap_sem for tasks that are in lower priority
    cgroups.
    
    This is because any down_write() on a semaphore essentially turns it into
    a mutex, so even if we currently have it held for reading, any new readers
    will not be allowed on to keep from starving the writer.  This is fine,
    except a lower priority task could be stuck doing IO because it has been
    throttled to the point that its IO is taking much longer than normal.  But
    because a higher priority group depends on this completing it is now stuck
    behind lower priority work.
    
    In order to avoid this particular priority inversion we want to use the
    existing retry mechanism to stop from holding the mmap_sem at all if we
    are going to do IO.  This already exists in the read case sort of, but
    needed to be extended for more than just grabbing the page lock.  With
    io.latency we throttle at submit_bio() time, so the readahead stuff can
    block and even page_cache_read can block, so all these paths need to have
    the mmap_sem dropped.
    
    The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
    because we have to reserve space for the dirty page, which can be a very
    expensive operation.  We use the same retry method as the read path, and
    simply cache the page and verify the page is still setup properly the next
    pass through ->page_mkwrite().
    
    I've tested these patches with xfstests and there are no regressions.
    
    This patch (of 3):
    
    If we do not have a page at filemap_fault time we'll do this weird forced
    page_cache_read thing to populate the page, and then drop it again and
    loop around and find it.  This makes for 2 ways we can read a page in
    filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
    flag so that pagecache_get_page() will return a unlocked page that's in
    pagecache.  Then use the normal page locking and readpage logic already in
    filemap_fault.  This simplifies the no page in page cache case
    significantly.
    
    [akpm@linux-foundation.org: fix comment text]
    [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
      Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
    Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
    
    
    Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    a75d4c33