• Aaron Lu's avatar
    mm, swap: use rbtree for swap_extent · 4efaceb1
    Aaron Lu authored
    swap_extent is used to map swap page offset to backing device's block
    offset.  For a continuous block range, one swap_extent is used and all
    these swap_extents are managed in a linked list.
    
    These swap_extents are used by map_swap_entry() during swap's read and
    write path.  To find out the backing device's block offset for a page
    offset, the swap_extent list will be traversed linearly, with
    curr_swap_extent being used as a cache to speed up the search.
    
    This works well as long as swap_extents are not huge or when the number
    of processes that access swap device are few, but when the swap device
    has many extents and there are a number of processes accessing the swap
    device concurrently, it can be a problem.  On one of our servers, the
    disk's remaining size is tight:
    
      $df -h
      Filesystem      Size  Used Avail Use% Mounted on
      ... ...
      /dev/nvme0n1p1  1.8T  1.3T  504G  72% /home/t4
    
    When creating a 80G swapfile there, there are as many as 84656 swap
    extents.  The end result is, kernel spends abou 30% time in
    map_swap_entry() and swap throughput is only 70MB/s.
    
    As a comparison, when I used smaller sized swapfile, like 4G whose
    swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
    map_swap_entry() is about 3%.
    
    One downside of using rbtree for swap_extent is, 'struct rbtree' takes
    24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
    for each swap_extent.  For a swapfile that has 80k swap_extents, that
    means 625KiB more memory consumed.
    
    Test:
    
    Since it's not possible to reboot that server, I can not test this patch
    diretly there.  Instead, I tested it on another server with NVMe disk.
    
    I created a 20G swapfile on an NVMe backed XFS fs.  By default, the
    filesystem is quite clean and the created swapfile has only 2 extents.
    Testing vanilla and this patch shows no obvious performance difference
    when swapfile is not fragmented.
    
    To see the patch's effects, I used some tweaks to manually fragment the
    swapfile by breaking the extent at 1M boundary.  This made the swapfile
    have 20K extents.
    
      nr_task=4
      kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
      vanilla  165191           90.77%             171798          90.21%
      patched  858993 +420%      2.16%             715827 +317%     0.77%
    
      nr_task=8
      kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
      vanilla  306783           92.19%             318145          87.76%
      patched  954437 +211%      2.35%            1073741 +237%     1.57%
    
    swapout: the throughput of swap out, in KB/s, higher is better 1st
    map_swap_entry: cpu cycles percent sampled by perf swapin: the
    throughput of swap in, in KB/s, higher is better.  2nd map_swap_entry:
    cpu cycles percent sampled by perf
    
    nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
    can be effectively used to cache the correct swap extent for single task
    workload.
    
    [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
    Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronluSigned-off-by: default avatarAaron Lu <ziqian.lzq@antfin.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    4efaceb1