Skip to content
  • Roman Gushchin's avatar
    writeback, cgroup: do not reparent dax inodes · 593311e8
    Roman Gushchin authored
    The inode switching code is not suited for dax inodes.  An attempt to
    switch a dax inode to a parent writeback structure (as a part of a
    writeback cleanup procedure) results in a panic like this:
    
      run fstests generic/270 at 2021-07-15 05:54:02
      XFS (pmem0p2): EXPERIMENTAL big timestamp feature in use.  Use at your own risk!
      XFS (pmem0p2): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
      XFS (pmem0p2): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
      XFS (pmem0p2): Mounting V5 Filesystem
      XFS (pmem0p2): Ending clean mount
      XFS (pmem0p2): Quotacheck needed: Please wait.
      XFS (pmem0p2): Quotacheck: Done.
      XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
      XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
      XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
      BUG: unable to handle page fault for address: 0000000005b0f669
      #PF: supervisor read access in kernel mode
      #PF: error_code(0x0000) - not-present page
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      CPU: 13 PID: 10479 Comm: kworker/13:16 Not tainted 5.14.0-rc1-master-8096acd7+ #8
      Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016
      Workqueue: inode_switch_wbs inode_switch_wbs_work_fn
      RIP: 0010:inode_do_switch_wbs+0xaf/0x470
      Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
      RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
      RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
      RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
      RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
      R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
      R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
      FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
      Call Trace:
       inode_switch_wbs_work_fn+0xb6/0x2a0
       process_one_work+0x1e6/0x380
       worker_thread+0x53/0x3d0
       kthread+0x10f/0x130
       ret_from_fork+0x22/0x30
      Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm mgag200 i2c_algo_bit iTCO_wdt irqbypass drm_kms_helper iTCO_vendor_support acpi_ipmi rapl syscopyarea sysfillrect intel_cstate ipmi_si sysimgblt ioatdma dax_pmem_compat fb_sys_fops ipmi_devintf device_dax i2c_i801 pcspkr intel_uncore hpilo nd_pmem cec dax_pmem_core dca i2c_smbus acpi_tad lpc_ich ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel tg3 ghash_clmulni_intel serio_raw hpsa hpwdt scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod
      CR2: 0000000005b0f669
      ---[ end trace ed2105faff8384f3 ]---
      RIP: 0010:inode_do_switch_wbs+0xaf/0x470
      Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
      RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
      RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
      RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
      RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
      R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
      R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
      FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: 0x15200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      ---[ end Kernel panic - not syncing: Fatal exception ]---
    
    The crash happens on an attempt to iterate over attached pagecache pages
    and check the dirty flag: a dax inode's xarray contains pfn's instead of
    generic struct page pointers.
    
    This happens for DAX and not for other kinds of non-page entries in the
    inodes because it's a tagged iteration, and shadow/swap entries are
    never tagged; only DAX entries get tagged.
    
    Fix the problem by bailing out (with the false return value) of
    inode_prepare_sbs_switch() if a dax inode is passed.
    
    [willy@infradead.org: changelog addition]
    
    Link: https://lkml.kernel.org/r/20210719171350.3876830-1-guro@fb.com
    Fixes: c22d70a1
    
     ("writeback, cgroup: release dying cgwbs by switching attached inodes")
    Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
    Reported-by: default avatarMurphy Zhou <jencce.kernel@gmail.com>
    Reported-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Tested-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Tested-by: default avatarMurphy Zhou <jencce.kernel@gmail.com>
    Acked-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    593311e8