• John Hubbard's avatar
    mm: introduce put_user_page*(), placeholder versions · fc1d8e7c
    John Hubbard authored
    A discussion of the overall problem is below.
    
    As mentioned in patch 0001, the steps are to fix the problem are:
    
    1) Provide put_user_page*() routines, intended to be used
       for releasing pages that were pinned via get_user_pages*().
    
    2) Convert all of the call sites for get_user_pages*(), to
       invoke put_user_page*(), instead of put_page(). This involves dozens of
       call sites, and will take some time.
    
    3) After (2) is complete, use get_user_pages*() and put_user_page*() to
       implement tracking of these pages. This tracking will be separate from
       the existing struct page refcounting.
    
    4) Use the tracking and identification of these pages, to implement
       special handling (especially in writeback paths) when the pages are
       backed by a filesystem.
    
    Overview
    ========
    
    Some kernel components (file systems, device drivers) need to access
    memory that is specified via process virtual address.  For a long time,
    the API to achieve that was get_user_pages ("GUP") and its variations.
    However, GUP has critical limitations that have been overlooked; in
    particular, GUP does not interact correctly with filesystems in all
    situations.  That means that file-backed memory + GUP is a recipe for
    potential problems, some of which have already occurred in the field.
    
    GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem
    code to get the struct page behind a virtual address and to let storage
    hardware perform a direct copy to or from that page.  This is a
    short-lived access pattern, and as such, the window for a concurrent
    writeback of GUP'd page was small enough that there were not (we think)
    any reported problems.  Also, userspace was expected to understand and
    accept that Direct IO was not synchronized with memory-mapped access to
    that data, nor with any process address space changes such as munmap(),
    mremap(), etc.
    
    Over the years, more GUP uses have appeared (virtualization, device
    drivers, RDMA) that can keep the pages they get via GUP for a long period
    of time (seconds, minutes, hours, days, ...).  This long-term pinning
    makes an underlying design problem more obvious.
    
    In fact, there are a number of key problems inherent to GUP:
    
    Interactions with file systems
    ==============================
    
    File systems expect to be able to write back data, both to reclaim pages,
    and for data integrity.  Allowing other hardware (NICs, GPUs, etc) to gain
    write access to the file memory pages means that such hardware can dirty
    the pages, without the filesystem being aware.  This can, in some cases
    (depending on filesystem, filesystem options, block device, block device
    options, and other variables), lead to data corruption, and also to kernel
    bugs of the form:
    
        kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
        backtrace:
            ext4_writepage
            __writepage
            write_cache_pages
            ext4_writepages
            do_writepages
            __writeback_single_inode
            writeback_sb_inodes
            __writeback_inodes_wb
            wb_writeback
            wb_workfn
            process_one_work
            worker_thread
            kthread
            ret_from_fork
    
    ...which is due to the file system asserting that there are still buffer
    heads attached:
    
            ({                                                      \
                    BUG_ON(!PagePrivate(page));                     \
                    ((struct buffer_head *)page_private(page));     \
            })
    
    Dave Chinner's description of this is very clear:
    
        "The fundamental issue is that ->page_mkwrite must be called on every
        write access to a clean file backed page, not just the first one.
        How long the GUP reference lasts is irrelevant, if the page is clean
        and you need to dirty it, you must call ->page_mkwrite before it is
        marked writeable and dirtied. Every. Time."
    
    This is just one symptom of the larger design problem: real filesystems
    that actually write to a backing device, do not actually support
    get_user_pages() being called on their pages, and letting hardware write
    directly to those pages--even though that pattern has been going on since
    about 2005 or so.
    
    Long term GUP
    =============
    
    Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
    writeable mapping is created), and the pages are file-backed.  That can
    lead to filesystem corruption.  What happens is that when a file-backed
    page is being written back, it is first mapped read-only in all of the CPU
    page tables; the file system then assumes that nobody can write to the
    page, and that the page content is therefore stable.  Unfortunately, the
    GUP callers generally do not monitor changes to the CPU pages tables; they
    instead assume that the following pattern is safe (it's not):
    
        get_user_pages()
    
        Hardware can keep a reference to those pages for a very long time,
        and write to it at any time.  Because "hardware" here means "devices
        that are not a CPU", this activity occurs without any interaction with
        the kernel's file system code.
    
        for each page
            set_page_dirty
            put_page()
    
    In fact, the GUP documentation even recommends that pattern.
    
    Anyway, the file system assumes that the page is stable (nothing is
    writing to the page), and that is a problem: stable page content is
    necessary for many filesystem actions during writeback, such as checksum,
    encryption, RAID striping, etc.  Furthermore, filesystem features like COW
    (copy on write) or snapshot also rely on being able to use a new page for
    as memory for that memory range inside the file.
    
    Corruption during write back is clearly possible here.  To solve that, one
    idea is to identify pages that have active GUP, so that we can use a
    bounce page to write stable data to the filesystem.  The filesystem would
    work on the bounce page, while any of the active GUP might write to the
    original page.  This would avoid the stable page violation problem, but
    note that it is only part of the overall solution, because other problems
    remain.
    
    Other filesystem features that need to replace the page with a new one can
    be inhibited for pages that are GUP-pinned.  This will, however, alter and
    limit some of those filesystem features.  The only fix for that would be
    to require GUP users to monitor and respond to CPU page table updates.
    Subsystems such as ODP and HMM do this, for example.  This aspect of the
    problem is still under discussion.
    
    Direct IO
    =========
    
    Direct IO can cause corruption, if userspace does Direct-IO that writes to
    a range of virtual addresses that are mmap'd to a file.  The pages written
    to are file-backed pages that can be under write back, while the Direct IO
    is taking place.  Here, Direct IO races with a write back: it calls GUP
    before page_mkclean() has replaced the CPU pte with a read-only entry.
    The race window is pretty small, which is probably why years have gone by
    before we noticed this problem: Direct IO is generally very quick, and
    tends to finish up before the filesystem gets around to do anything with
    the page contents.  However, it's still a real problem.  The solution is
    to never let GUP return pages that are under write back, but instead,
    force GUP to take a write fault on those pages.  That way, GUP will
    properly synchronize with the active write back.  This does not change the
    required GUP behavior, it just avoids that race.
    
    Details
    =======
    
    Introduces put_user_page(), which simply calls put_page().  This provides
    a way to update all get_user_pages*() callers, so that they call
    put_user_page(), instead of put_page().
    
    Also introduces put_user_pages(), and a few dirty/locked variations, as a
    replacement for release_pages(), and also as a replacement for open-coded
    loops that release multiple pages.  These may be used for subsequent
    performance improvements, via batching of pages to be released.
    
    This is the first step of fixing a problem (also described in [1] and [2])
    with interactions between get_user_pages ("gup") and filesystems.
    
    Problem description: let's start with a bug report.  Below, is what
    happens sometimes, under memory pressure, when a driver pins some pages
    via gup, and then marks those pages dirty, and releases them.  Note that
    the gup documentation actually recommends that pattern.  The problem is
    that the filesystem may do a writeback while the pages were gup-pinned,
    and then the filesystem believes that the pages are clean.  So, when the
    driver later marks the pages as dirty, that conflicts with the
    filesystem's page tracking and results in a BUG(), like this one that I
    experienced:
    
        kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
        backtrace:
            ext4_writepage
            __writepage
            write_cache_pages
            ext4_writepages
            do_writepages
            __writeback_single_inode
            writeback_sb_inodes
            __writeback_inodes_wb
            wb_writeback
            wb_workfn
            process_one_work
            worker_thread
            kthread
            ret_from_fork
    
    ...which is due to the file system asserting that there are still buffer
    heads attached:
    
            ({                                                      \
                    BUG_ON(!PagePrivate(page));                     \
                    ((struct buffer_head *)page_private(page));     \
            })
    
    Dave Chinner's description of this is very clear:
    
        "The fundamental issue is that ->page_mkwrite must be called on
        every write access to a clean file backed page, not just the first
        one.  How long the GUP reference lasts is irrelevant, if the page is
        clean and you need to dirty it, you must call ->page_mkwrite before it
        is marked writeable and dirtied.  Every.  Time."
    
    This is just one symptom of the larger design problem: real filesystems
    that actually write to a backing device, do not actually support
    get_user_pages() being called on their pages, and letting hardware write
    directly to those pages--even though that pattern has been going on since
    about 2005 or so.
    
    The steps are to fix it are:
    
    1) (This patch): provide put_user_page*() routines, intended to be used
       for releasing pages that were pinned via get_user_pages*().
    
    2) Convert all of the call sites for get_user_pages*(), to
       invoke put_user_page*(), instead of put_page(). This involves dozens of
       call sites, and will take some time.
    
    3) After (2) is complete, use get_user_pages*() and put_user_page*() to
       implement tracking of these pages. This tracking will be separate from
       the existing struct page refcounting.
    
    4) Use the tracking and identification of these pages, to implement
       special handling (especially in writeback paths) when the pages are
       backed by a filesystem.
    
    [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
    [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
    
    Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>		[docs]
    Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
    Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
    Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
    Tested-by: default avatarIra Weiny <ira.weiny@intel.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    fc1d8e7c
mm.h 88.1 KB