Skip to content
  • Dennis Zhou's avatar
    btrfs: add the beginning of async discard, discard workqueue · b0643e59
    Dennis Zhou authored
    
    
    When discard is enabled, everytime a pinned extent is released back to
    the block_group's free space cache, a discard is issued for the extent.
    This is an overeager approach when it comes to discarding and helping
    the SSD maintain enough free space to prevent severe garbage collection
    situations.
    
    This adds the beginning of async discard. Instead of issuing a discard
    prior to returning it to the free space, it is just marked as untrimmed.
    The block_group is then added to a LRU which then feeds into a workqueue
    to issue discards at a much slower rate. Full discarding of unused block
    groups is still done and will be addressed in a future patch of the
    series.
    
    For now, we don't persist the discard state of extents and bitmaps.
    Therefore, our failure recovery mode will be to consider extents
    untrimmed. This lets us handle failure and unmounting as one in the
    same.
    
    On a number of Facebook webservers, I collected data every minute
    accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
    and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
    is where we discard extents synchronously before returning them to the
    free space cache.
    
    discard=sync:
                     p99 total per minute       p99 total per minute
          Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
        ---------------------------------------------------------------
         Drive A  |           434           |          1170
         Drive B  |           880           |          2330
         Drive C  |          2943           |          3920
         Drive D  |          4763           |          5701
    
    discard=async:
                     p99 total per minute       p99 total per minute
          Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
        --------------------------------------------------------------
         Drive A  |           134           |           956
         Drive B  |            64           |          1972
         Drive C  |            59           |          1032
         Drive D  |            62           |          1200
    
    While it's not great that the stats are cumulative over 1m, all of these
    servers are running the same workload and and the delta between the two
    are substantial. We are spending significantly less time in
    btrfs_finish_extent_commit() which is responsible for discarding.
    
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
    Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    b0643e59