• Eric Ren's avatar
    ocfs2: fix crash caused by stale lvb with fsdlm plugin · e7ee2c08
    Eric Ren authored
    The crash happens rather often when we reset some cluster nodes while
    nodes contend fiercely to do truncate and append.
    
    The crash backtrace is below:
    
       dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
       dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
       ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
       ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
       ocfs2: Beginning quota recovery on device (253,18) for slot 2
       ocfs2: Finishing quota recovery on device (253,18) for slot 2
       (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
       (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
       ------------[ cut here ]------------
       kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
       invalid opcode: 0000 [#1] SMP
       Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
       Supported: No, Unsupported modules are loaded
       CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
       task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
       RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
       RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
       RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
       RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
       R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
       R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
       FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
       Call Trace:
         ocfs2_setattr+0x698/0xa90 [ocfs2]
         notify_change+0x1ae/0x380
         do_truncate+0x5e/0x90
         do_sys_ftruncate.constprop.11+0x108/0x160
         entry_SYSCALL_64_fastpath+0x12/0x6d
       Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
       RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
    
    It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
    not equal to the disk i_size.  We mistakenly trust the LVB because the
    underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
    DLM_SBF_VALNOTVALID properly for us.  But, why?
    
    The current code tries to downconvert lock without DLM_LKF_VALBLK flag
    to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
    if the lock resource type needs LVB.  This is not the right way for
    fsdlm.
    
    The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
    DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
    DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
    this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
    failure happens.
    
    The following diagram briefly illustrates how this crash happens:
    
    RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
    
    The 1st round:
    
                 Node1                                    Node2
    RSB1: PR
                                                      RSB1(master): NULL->EX
    ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
      ocfs2_dlm_lock(no DLM_LKF_VALBLK)
    
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    
    dlm_lock(no DLM_LKF_VALBLK)
      convert_lock(overwrite lkb->lkb_exflags
                   with no DLM_LKF_VALBLK)
    
    RSB1: NULL                                        RSB1: EX
                                                      reset Node2
    dlm_recover_rsbs()
      recover_lvb()
    
    /* The LVB is not trustable if the node with EX fails and
     * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
     */
    
     if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
               return;                   * to invalid the LVB here.
                                         */
    
    The 2nd round:
    
             Node 1                                Node2
    RSB1(become master from recovery)
    
    ocfs2_setattr()
      ocfs2_inode_lock(NULL->EX)
        /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
        ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
      ocfs2_truncate_file()
          mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */
    
    The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
    for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
    is uesed.
    
    Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.comSigned-off-by: 's avatarEric Ren <zren@suse.com>
    Reviewed-by: 's avatarJoseph Qi <jiangqi903@gmail.com>
    Cc: Mark Fasheh <mfasheh@versity.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Junxiao Bi <junxiao.bi@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    e7ee2c08