Skip to content
  • Yonghong Song's avatar
    bpf: Fix a rcu_sched stall issue with bpf task/task_file iterator · e679654a
    Yonghong Song authored
    
    
    In our production system, we observed rcu stalls when
    'bpftool prog` is running.
      rcu: INFO: rcu_sched self-detected stall on CPU
      rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913
      \x09(t=21031 jiffies g=2534773 q=179750)
      NMI backtrace for cpu 7
      CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G        W         5.8.0-00004-g68bfc7f8c1b4 #6
      Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
      Call Trace:
      <IRQ>
      dump_stack+0x57/0x70
      nmi_cpu_backtrace.cold+0x14/0x53
      ? lapic_can_unplug_cpu.cold+0x39/0x39
      nmi_trigger_cpumask_backtrace+0xb7/0xc7
      rcu_dump_cpu_stacks+0xa2/0xd0
      rcu_sched_clock_irq.cold+0x1ff/0x3d9
      ? tick_nohz_handler+0x100/0x100
      update_process_times+0x5b/0x90
      tick_sched_timer+0x5e/0xf0
      __hrtimer_run_queues+0x12a/0x2a0
      hrtimer_interrupt+0x10e/0x280
      __sysvec_apic_timer_interrupt+0x51/0xe0
      asm_call_on_stack+0xf/0x20
      </IRQ>
      sysvec_apic_timer_interrupt+0x6f/0x80
      asm_sysvec_apic_timer_interrupt+0x12/0x20
      RIP: 0010:task_file_seq_get_next+0x71/0x220
      Code: 00 00 8b 53 1c 49 8b 7d 00 89 d6 48 8b 47 20 44 8b 18 41 39 d3 76 75 48 8b 4f 20 8b 01 39 d0 76 61 41 89 d1 49 39 c1 48 19 c0 <48> 8b 49 08 21 d0 48 8d 04 c1 4c 8b 08 4d 85 c9 74 46 49 8b 41 38
      RSP: 0018:ffffc90006223e10 EFLAGS: 00000297
      RAX: ffffffffffffffff RBX: ffff888f0d172388 RCX: ffff888c8c07c1c0
      RDX: 00000000000f017b RSI: 00000000000f017b RDI: ffff888c254702c0
      RBP: ffffc90006223e68 R08: ffff888be2a1c140 R09: 00000000000f017b
      R10: 0000000000000002 R11: 0000000000100000 R12: ffff888f23c24118
      R13: ffffc90006223e60 R14: ffffffff828509a0 R15: 00000000ffffffff
      task_file_seq_next+0x52/0xa0
      bpf_seq_read+0xb9/0x320
      vfs_read+0x9d/0x180
      ksys_read+0x5f/0xe0
      do_syscall_64+0x38/0x60
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7f8815f4f76e
      Code: c0 e9 f6 fe ff ff 55 48 8d 3d 76 70 0a 00 48 89 e5 e8 36 06 02 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 52 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5
      RSP: 002b:00007fff8f9df578 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      RAX: ffffffffffffffda RBX: 000000000170b9c0 RCX: 00007f8815f4f76e
      RDX: 0000000000001000 RSI: 00007fff8f9df5b0 RDI: 0000000000000007
      RBP: 00007fff8f9e05f0 R08: 0000000000000049 R09: 0000000000000010
      R10: 00007f881601fa40 R11: 0000000000000246 R12: 00007fff8f9e05a8
      R13: 00007fff8f9e05a8 R14: 0000000001917f90 R15: 000000000000e22e
    
    Note that `bpftool prog` actually calls a task_file bpf iterator
    program to establish an association between prog/map/link/btf anon
    files and processes.
    
    In the case where the above rcu stall occured, we had a process
    having 1587 tasks and each task having roughly 81305 files.
    This implied 129 million bpf prog invocations. Unfortunwtely none of
    these files are prog/map/link/btf files so bpf iterator/prog needs
    to traverse all these files and not able to return to user space
    since there are no seq_file buffer overflow.
    
    This patch fixed the issue in bpf_seq_read() to limit the number
    of visited objects. If the maximum number of visited objects is
    reached, no more objects will be visited in the current syscall.
    If there is nothing written in the seq_file buffer, -EAGAIN will
    return to the user so user can try again.
    
    The maximum number of visited objects is set at 1 million.
    In our Intel Xeon D-2191 2.3GHZ 18-core server, bpf_seq_read()
    visiting 1 million files takes around 0.18 seconds.
    
    We did not use cond_resched() since for some iterators, e.g.,
    netlink iterator, where rcu read_lock critical section spans between
    consecutive seq_ops->next(), which makes impossible to do cond_resched()
    in the key while loop of function bpf_seq_read().
    
    Signed-off-by: default avatarYonghong Song <yhs@fb.com>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Link: https://lore.kernel.org/bpf/20200818222309.2181348-1-yhs@fb.com
    e679654a