• Eryu Guan's avatar
    xfs: truncate pagecache before writeback in xfs_setattr_size() · 350976ae
    Eryu Guan authored
    On truncate down, if new size is not block size aligned, we zero the
    rest of block to avoid exposing stale data to user, and
    iomap_truncate_page() skips zeroing if the range is already in
    unwritten state or a hole. Then we writeback from on-disk i_size to
    the new size if this range hasn't been written to disk yet, and
    truncate page cache beyond new EOF and set in-core i_size.
    
    The problem is that we could write data between di_size and newsize
    before removing the page cache beyond newsize, as the extents may
    still be in unwritten state right after a buffer write. As such, the
    page of data that newsize lies in has not been zeroed by page cache
    invalidation before it is written, and xfs_do_writepage() hasn't
    triggered it's "zero data beyond EOF" case because we haven't
    updated in-core i_size yet. Then a subsequent mmap read could see
    non-zeros past EOF.
    
    I occasionally see this in fsx runs in fstests generic/112, a
    simplified fsx operation sequence is like (assuming 4k block size
    xfs):
    
      fallocate 0x0 0x1000 0x0 keep_size
      write 0x0 0x1000 0x0
      truncate 0x0 0x800 0x1000
      punch_hole 0x0 0x800 0x800
      mapread 0x0 0x800 0x800
    
    where fallocate allocates unwritten extent but doesn't update
    i_size, buffer write populates the page cache and extent is still
    unwritten, truncate skips zeroing page past new EOF and writes the
    page to disk, punch_hole invalidates the page cache, at last mapread
    reads the block back and sees non-zero beyond EOF.
    
    Fix it by moving truncate_setsize() to before writeback so the page
    cache invalidation zeros the partial page at the new EOF. This also
    triggers "zero data beyond EOF" in xfs_do_writepage() at writeback
    time, because newsize has been set and page straddles the newsize.
    
    Also fixed the wrong 'end' param of filemap_write_and_wait_range()
    call while we're at it, the 'end' is inclusive and should be
    'newsize - 1'.
    Suggested-by: 's avatarDave Chinner <dchinner@redhat.com>
    Signed-off-by: 's avatarEryu Guan <eguan@redhat.com>
    Acked-by: 's avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: 's avatarBrian Foster <bfoster@redhat.com>
    Reviewed-by: 's avatarDarrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: 's avatarDarrick J. Wong <darrick.wong@oracle.com>
    350976ae