1. 10 Feb, 2017 1 commit
    • Kees Cook's avatar
      time: Remove CONFIG_TIMER_STATS · dfb4357d
      Kees Cook authored
      Currently CONFIG_TIMER_STATS exposes process information across namespaces:
      kernel/time/timer_list.c print_timer():
              SEQ_printf(m, ", %s/%d", tmp, timer->start_pid);
       #11: <0000000000000000>, hrtimer_wakeup, S:01, do_nanosleep, cron/2570
      Given that the tracer can give the same information, this patch entirely
      removes CONFIG_TIMER_STATS.
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: linux-doc@vger.kernel.org
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Xing Gao <xgao01@email.wm.edu>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Jessica Frazelle <me@jessfraz.com>
      Cc: kernel-hardening@lists.openwall.com
      Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-api@vger.kernel.org
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
  2. 19 Oct, 2016 1 commit
    • Tejun Heo's avatar
      workqueue: move wq_numa_init() to workqueue_init() · 2186d9f9
      Tejun Heo authored
      While splitting up workqueue initialization into two parts,
      ac8f73400782 ("workqueue: make workqueue available early during boot")
      put wq_numa_init() into workqueue_init_early().  Unfortunately, on
      some archs including power and arm64, cpu to node mapping isn't yet
      established by the time the early init is called leading to incorrect
      NUMA initialization and subsequently the following oops due to zero
      cpumask on node-specific unbound pools.
        Unable to handle kernel paging request for data at address 0x00000038
        Faulting instruction address: 0xc0000000000fc0cc
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA PowerNV
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
        task: c0000007f5400000 task.stack: c000001ffc084000
        NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
        REGS: c000001ffc087780 TRAP: 0300   Not tainted  (4.8.0-compiler_gcc-6.2.0-next-20161005)
        MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 48000424  XER: 00000000
        CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
        GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
        GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
        GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
        GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
        GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
        GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
        GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
        GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
        NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
        LR [c0000000000ed928] activate_task+0x78/0xe0
        Call Trace:
        [c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
        [c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
        [c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
        [c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
        [c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
        [c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
        [c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
        [c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
        [c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
        Instruction dump:
        62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
        60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
        ---[ end trace 0000000000000000 ]---
      Fix it by moving wq_numa_init() to workqueue_init().  As this means
      that the early intialization may not have full NUMA info for per-cpu
      pools and ignores NUMA affinity for unbound pools, fix them up from
      workqueue_init() after wq_numa_init().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
      Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  3. 11 Oct, 2016 1 commit
  4. 17 Sep, 2016 2 commits
    • Tejun Heo's avatar
      workqueue: remove keventd_up() · 863b710b
      Tejun Heo authored
      keventd_up() no longer has in-kernel users.  Remove it and make
      wq_online static.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      workqueue: make workqueue available early during boot · 3347fa09
      Tejun Heo authored
      Workqueue is currently initialized in an early init call; however,
      there are cases where early boot code has to be split and reordered to
      come after workqueue initialization or the same code path which makes
      use of workqueues is used both before workqueue initailization and
      after.  The latter cases have to gate workqueue usages with
      keventd_up() tests, which is nasty and easy to get wrong.
      Workqueue usages have become widespread and it'd be a lot more
      convenient if it can be used very early from boot.  This patch splits
      workqueue initialization into two steps.  workqueue_init_early() which
      sets up the basic data structures so that workqueues can be created
      and work items queued, and workqueue_init() which actually brings up
      workqueues online and starts executing queued work items.  The former
      step can be done very early during boot once memory allocation,
      cpumasks and idr are initialized.  The latter right after kthreads
      become available.
      This allows work item queueing and canceling from very early boot
      which is what most of these use cases want.
      * As systemd_wq being initialized doesn't indicate that workqueue is
        fully online anymore, update keventd_up() to test wq_online instead.
        The follow-up patches will get rid of all its usages and the
        function itself.
      * Flushing doesn't make sense before workqueue is fully initialized.
        The flush functions trigger WARN and return immediately before fully
      * Work items are never in-flight before fully online.  Canceling can
        always succeed by skipping the flush step.
      * Some code paths can no longer assume to be called with irq enabled
        as irq is disabled during early boot.  Use irqsave/restore
        operations instead.
      v2: Watchdog init, which requires timer to be running, moved from
          workqueue_init_early() to workqueue_init().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com
  5. 16 Sep, 2016 1 commit
  6. 29 Aug, 2016 1 commit
  7. 14 Jul, 2016 1 commit
  8. 01 Jul, 2016 1 commit
  9. 16 Jun, 2016 1 commit
    • Peter Zijlstra's avatar
      workqueue: Fix setting affinity of unbound worker threads · d945b5e9
      Peter Zijlstra authored
      With commit e9d867a6
       ("sched: Allow per-cpu kernel threads to
      run on online && !active"), __set_cpus_allowed_ptr() expects that only
      strict per-cpu kernel threads can have affinity to an online CPU which
      is not yet active.
      This assumption is currently broken in the CPU_ONLINE notification
      handler for the workqueues where restore_unbound_workers_cpumask()
      calls set_cpus_allowed_ptr() when the first cpu in the unbound
      worker's pool->attr->cpumask comes online. Since
      set_cpus_allowed_ptr() is called with pool->attr->cpumask in which
      only one CPU is online which is not yet active, we get the following
      WARN_ON during an CPU online operation.
      ------------[ cut here ]------------
      WARNING: CPU: 40 PID: 248 at kernel/sched/core.c:1166
      Modules linked in:
      CPU: 40 PID: 248 Comm: cpuhp/40 Not tainted 4.6.0-autotest+ #4
      Call Trace:
      [c000000f273ff920] [c00000000010493c] __set_cpus_allowed_ptr+0x2cc/0x2e0 (unreliable)
      [c000000f273ffac0] [c0000000000ed4b0] workqueue_cpu_up_callback+0x2c0/0x470
      [c000000f273ffb70] [c0000000000f5c58] notifier_call_chain+0x98/0x100
      [c000000f273ffbc0] [c0000000000c5ed0] __cpu_notify+0x70/0xe0
      [c000000f273ffc00] [c0000000000c6028] notify_online+0x38/0x50
      [c000000f273ffc30] [c0000000000c5214] cpuhp_invoke_callback+0x84/0x250
      [c000000f273ffc90] [c0000000000c562c] cpuhp_up_callbacks+0x5c/0x120
      [c000000f273ffce0] [c0000000000c64d4] cpuhp_thread_fun+0x184/0x1c0
      [c000000f273ffd20] [c0000000000fa050] smpboot_thread_fn+0x290/0x2a0
      [c000000f273ffd80] [c0000000000f45b0] kthread+0x110/0x130
      [c000000f273ffe30] [c000000000009570] ret_from_kernel_thread+0x5c/0x6c
      ---[ end trace 00f1456578b2a3b2 ]---
      This patch fixes this by limiting the mask to the intersection of
      the pool affinity and online CPUs.
      Changelog-cribbed-from: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reported-by: default avatarAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  10. 20 May, 2016 2 commits
    • Du, Changbin's avatar
      debugobjects: insulate non-fixup logic related to static obj from fixup callbacks · b9fdac7f
      Du, Changbin authored
      When activating a static object we need make sure that the object is
      tracked in the object tracker.  If it is a non-static object then the
      activation is illegal.
      In previous implementation, each subsystem need take care of this in
      their fixup callbacks.  Actually we can put it into debugobjects core.
      Thus we can save duplicated code, and have *pure* fixup callbacks.
      To achieve this, a new callback "is_static_object" is introduced to let
      the type specific code decide whether a object is static or not.  If
      yes, we take it into object tracker, otherwise give warning and invoke
      fixup callback.
      This change has paassed debugobjects selftest, and I also do some test
      with all debugobjects supports enabled.
      At last, I have a concern about the fixups that can it change the object
      which is in incorrect state on fixup? Because the 'addr' may not point
      to any valid object if a non-static object is not tracked.  Then Change
      such objec...
    • Du, Changbin's avatar
      workqueue: update debugobjects fixup callbacks return type · 02a982a6
      Du, Changbin authored
      Update the return type to use bool instead of int, corresponding to
      change (debugobjects: make fixup functions return bool instead of int)
      Signed-off-by: default avatarDu, Changbin <changbin.du@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Triplett <josh@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  11. 12 May, 2016 1 commit
    • Wanpeng Li's avatar
      workqueue: fix rebind bound workers warning · f7c17d26
      Wanpeng Li authored
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
      Modules linked in:
      CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
      Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
       0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
       0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
       ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
      Call Trace:
       ? trace_hardirqs_on_caller+0xf2/0x220
       ? notify_prepare+0x80/0x80
       ? notify_prepare+0x80/0x80
       ? __schedule+0x33e/0x8a0
       ? smpboot_create_threads+0x80/0x80
       ? wait_for_completion+0xf0/0x120
       ? schedule_tail+0x35/0xf0
       ? __init_kthread_worker+0x70/0x70
      ---[ end trace eb12ae47d2382d8f ]---
      notify_down_prepare: attempt to take down CPU 0 failed
      This bug can be reproduced by below config w/ nohz_full= all cpus:
      As Thomas pointed out:
      | If a down prepare callback fails, then DOWN_FAILED is invoked for all
      | callbacks which have successfully executed DOWN_PREPARE.
      | But, workqueue has actually two notifiers. One which handles
      | UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
      | Now look at the priorities of those callbacks:
      | CPU_PRI_WORKQUEUE_UP        = 5
      | CPU_PRI_WORKQUEUE_DOWN      = -5
      | So the call order on DOWN_PREPARE is:
      | CB 1
      | CB ...
      | CB workqueue_up() -> Ignores DOWN_PREPARE
      | CB ...
      | CB X ---> Fails
      | So we call up to CB X with DOWN_FAILED
      | CB 1
      | CB ...
      | CB workqueue_up() -> Handles DOWN_FAILED
      | CB ...
      | CB X-1
      | So the problem is that the workqueue stuff handles DOWN_FAILED in the up
      | callback, while it should do it in the down callback. Which is not a good idea
      | either because it wants to be called early on rollback...
      | Brilliant stuff, isn't it? The hotplug rework will solve this problem because
      | the callbacks become symetric, but for the existing mess, we need some
      | workaround in the workqueue code.
      The boot CPU handles housekeeping duty(unbound timers, workqueues,
      timekeeping, ...) on behalf of full dynticks CPUs. It must remain
      online when nohz full is enabled. There is a priority set to every
      workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down
      So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
      notifier_blocks behind tick_nohz_cpu_down will not be called any
      more, which leads to workers are actually not unbound. Then hotplug
      state machine will fallback to undo and online cpu 0 again. Workers
      will be rebound unconditionally even if they are not unbound and
      trigger the warning in this progress.
      This patch fix it by catching !DISASSOCIATED to avoid rebind bound
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
  12. 26 Apr, 2016 1 commit
    • Roman Pen's avatar
      workqueue: fix ghost PENDING flag while doing MQ IO · 346c09f8
      Roman Pen authored
      The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
      with the following backtrace:
      [  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
      [  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
      [  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
      [  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
      [  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
      [  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
      [  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
      [  601.350965] Call Trace:
      [  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
      [  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
      [  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
      [  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
      [  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
      [  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
      [  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
      [  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
      [  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
      [  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
      [  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
      [  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
      [  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
      [  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
      [  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
      [  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
      [  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
      [  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
      [  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50
      The underlying device is a null_blk, with default parameters:
        queue_mode    = MQ
        submit_queues = 1
      Verification that nullb0 has something inflight:
      root@pserver8:~# cat /sys/block/nullb0/inflight
             0        1
      root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
      CTX pending:
      During debug it became clear that stalled request is always inserted in
      the rq_list from the following path:
         save_stack_trace_tsk + 34
         blk_mq_insert_requests + 231
         blk_mq_flush_plug_list + 281
         blk_flush_plug_list + 199
         wait_on_page_bit + 192
         __filemap_fdatawait_range + 228
         filemap_fdatawait_range + 20
         filemap_write_and_wait_range + 63
         blkdev_fsync + 27
         vfs_fsync_range + 73
         blkdev_write_iter + 202
         __vfs_write + 170
         vfs_write + 169
         kernel_write + 56
      So blk_flush_plug_list() was called with from_schedule == true.
      If from_schedule is true, that means that finally blk_mq_insert_requests()
      offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
      i.e. it calls kblockd_schedule_delayed_work_on().
      That means, that we race with another CPU, which is about to execute
      __blk_mq_run_hw_queue() work.
      Further debugging shows the following traces from different CPUs:
        CPU#0                                  CPU#1
        ----------------------------------     -------------------------------
        reqeust A inserted
        STORE hctx->ctx_map[0] bit marked
        kblockd_schedule...() returns 1
        <schedule to kblockd workqueue>
                                               request B inserted
                                               STORE hctx->ctx_map[1] bit marked
                                               kblockd_schedule...() returns 0
        *** WORK PENDING bit is cleared ***
        flush_busy_ctxs() is executed, but
        bit 1, set by CPU#1, is not observed
      As a result request B pended forever.
      This behaviour can be explained by speculative LOAD of hctx->ctx_map on
      CPU#0, which is reordered with clear of PENDING bit and executed _before_
      actual STORE of bit 1 on CPU#1.
      The proper fix is an explicit full barrier <mfence>, which guarantees
      that clear of PENDING bit is to be executed before all possible
      speculative LOADS or STORES inside actual work function.
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
      Cc: Michael Wang <yun.wang@profitbricks.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  13. 15 Mar, 2016 1 commit
    • Peter Zijlstra's avatar
      tags: Fix DEFINE_PER_CPU expansions · 25528213
      Peter Zijlstra authored
      $ make tags
        GEN     tags
      ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
      ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
      ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
      ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
      ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
      ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
      ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
      Which are all the result of the DEFINE_PER_CPU pattern:
        scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
        scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
      The below cures them. All except the workqueue one are within reasonable
      distance of the 80 char limit. TJ do you have any preference on how to
      fix the wq one, or shall we just not care its too long?
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  14. 11 Mar, 2016 1 commit
  15. 02 Mar, 2016 1 commit
  16. 17 Feb, 2016 1 commit
  17. 10 Feb, 2016 1 commit
    • Tejun Heo's avatar
      workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup · d6e022f1
      Tejun Heo authored
      When looking up the pool_workqueue to use for an unbound workqueue,
      workqueue assumes that the target CPU is always bound to a valid NUMA
      node.  However, currently, when a CPU goes offline, the mapping is
      destroyed and cpu_to_node() returns NUMA_NO_NODE.
      This has always been broken but hasn't triggered often enough before
      874bbfe6 ("workqueue: make sure delayed work run in local cpu").
      After the commit, workqueue forcifully assigns the local CPU for
      delayed work items without explicit target CPU to fix a different
      issue.  This widens the window where CPU can go offline while a
      delayed work item is pending causing delayed work items dispatched
      with target CPU set to an already offlined CPU.  The resulting
      NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
      NULL pool_workqueue and thus crash.
      While 874bbfe6
       has been reverted for a different reason making the
      bug less visible again, it can still happen.  Fix it by mapping
      NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
      This is a temporary workaround.  The long term solution is keeping CPU
      -> NODE mapping stable across CPU off/online cycles which is being
      worked on.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
      Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
  18. 09 Feb, 2016 3 commits
    • Tejun Heo's avatar
      workqueue: implement "workqueue.debug_force_rr_cpu" debug feature · f303fccb
      Tejun Heo authored
      Workqueue used to guarantee local execution for work items queued
      without explicit target CPU.  The guarantee is gone now which can
      break some usages in subtle ways.  To flush out those cases, this
      patch implements a debug feature which forces round-robin CPU
      selection for all such work items.
      The debug feature defaults to off and can be enabled with a kernel
      parameter.  The default can be flipped with a debug config option.
      If you hit this commit during bisection, please refer to 041bd12e
      ("Revert "workqueue: make sure delayed work run in local cpu"") for
      more information and ping me.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Mike Galbraith's avatar
      workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs · ef557180
      Mike Galbraith authored
      WORK_CPU_UNBOUND work items queued to a bound workqueue always run
      locally.  This is a good thing normally, but not when the user has
      asked us to keep unbound work away from certain CPUs.  Round robin
      these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
      trumps performance.
      tj: Cosmetic and comment changes.  WARN_ON_ONCE() dropped from empty
          (wq_unbound_cpumask AND cpu_online_mask).  If we want that, it
          should be done when config changes.
      Signed-off-by: default avatarMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      Revert "workqueue: make sure delayed work run in local cpu" · 041bd12e
      Tejun Heo authored
      This reverts commit 874bbfe6.
      Workqueue used to implicity guarantee that work items queued without
      explicit CPU specified are put on the local CPU.  Recent changes in
      timer broke the guarantee and led to vmstat breakage which was fixed
      by 176bed1d ("vmstat: explicitly schedule per-cpu work on the CPU
      we need it to run on").
      vmstat is the most likely to expose the issue and it's quite possible
      that there are other similar problems which are a lot more difficult
      to trigger.  As a preventive measure, 874bbfe6 ("workqueue: make
      sure delayed work run in local cpu") was applied to restore the local
      CPU guarnatee.  Unfortunately, the change exposed a bug in timer code
      which got fixed by 22b886dd ("timers: Use proper base migration in
      add_timer_on()").  Due to code restructuring, the commit couldn't be
      backported beyond certain point and stable kernels which only had
      874bbfe6 started crashing.
      The local CPU guarantee was accidental more than anything else and we
      want to get rid of it anyway.  As, with the vmstat case fixed,
       is causing more problems than it's fixing, it has been
      decided to take the chance and officially break the guarantee by
      reverting the commit.  A debug feature will be added to force foreign
      CPU assignment to expose cases relying on the guarantee and fixes for
      the individual cases will be backported to stable as necessary.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 874bbfe6 ("workqueue: make sure delayed work run in local cpu")
      Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
      Cc: stable@vger.kernel.org
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
  19. 29 Jan, 2016 1 commit
    • Tejun Heo's avatar
      workqueue: skip flush dependency checks for legacy workqueues · 23d11a58
      Tejun Heo authored
       ("workqueue: warn if memory reclaim tries to flush
      !WQ_MEM_RECLAIM workqueue") implemented flush dependency warning which
      triggers if a PF_MEMALLOC task or WQ_MEM_RECLAIM workqueue tries to
      flush a !WQ_MEM_RECLAIM workquee.
      This assumes that workqueues marked with WQ_MEM_RECLAIM sit in memory
      reclaim path and making it depend on something which may need more
      memory to make forward progress can lead to deadlocks.  Unfortunately,
      workqueues created with the legacy create*_workqueue() interface
      always have WQ_MEM_RECLAIM regardless of whether they are depended
      upon memory reclaim or not.  These spurious WQ_MEM_RECLAIM markings
      cause spurious triggering of the flush dependency checks.
        WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
        workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
        Workqueue: deferwq deferred_probe_work_func
        [<c0017acc>] (unwind_backtrace) from [<c0013134>] (show_stack+0x10/0x14)
        [<c0013134>] (show_stack) from [<c0245f18>] (dump_stack+0x94/0xd4)
        [<c0245f18>] (dump_stack) from [<c0026f9c>] (warn_slowpath_common+0x80/0xb0)
        [<c0026f9c>] (warn_slowpath_common) from [<c0026ffc>] (warn_slowpath_fmt+0x30/0x40)
        [<c0026ffc>] (warn_slowpath_fmt) from [<c00390b8>] (check_flush_dependency+0x138/0x144)
        [<c00390b8>] (check_flush_dependency) from [<c0039ca0>] (flush_work+0x50/0x15c)
        [<c0039ca0>] (flush_work) from [<c00c51b0>] (lru_add_drain_all+0x130/0x180)
        [<c00c51b0>] (lru_add_drain_all) from [<c00f728c>] (migrate_prep+0x8/0x10)
        [<c00f728c>] (migrate_prep) from [<c00bfbc4>] (alloc_contig_range+0xd8/0x338)
        [<c00bfbc4>] (alloc_contig_range) from [<c00f8f18>] (cma_alloc+0xe0/0x1ac)
        [<c00f8f18>] (cma_alloc) from [<c001cac4>] (__alloc_from_contiguous+0x38/0xd8)
        [<c001cac4>] (__alloc_from_contiguous) from [<c001ceb4>] (__dma_alloc+0x240/0x278)
        [<c001ceb4>] (__dma_alloc) from [<c001cf78>] (arm_dma_alloc+0x54/0x5c)
        [<c001cf78>] (arm_dma_alloc) from [<c0355ea4>] (dmam_alloc_coherent+0xc0/0xec)
        [<c0355ea4>] (dmam_alloc_coherent) from [<c039cc4c>] (ahci_port_start+0x150/0x1dc)
        [<c039cc4c>] (ahci_port_start) from [<c0384734>] (ata_host_start.part.3+0xc8/0x1c8)
        [<c0384734>] (ata_host_start.part.3) from [<c03898dc>] (ata_host_activate+0x50/0x148)
        [<c03898dc>] (ata_host_activate) from [<c039d558>] (ahci_host_activate+0x44/0x114)
        [<c039d558>] (ahci_host_activate) from [<c039f05c>] (ahci_platform_init_host+0x1d8/0x3c8)
        [<c039f05c>] (ahci_platform_init_host) from [<c039e6bc>] (tegra_ahci_probe+0x448/0x4e8)
        [<c039e6bc>] (tegra_ahci_probe) from [<c0347058>] (platform_drv_probe+0x50/0xac)
        [<c0347058>] (platform_drv_probe) from [<c03458cc>] (driver_probe_device+0x214/0x2c0)
        [<c03458cc>] (driver_probe_device) from [<c0343cc0>] (bus_for_each_drv+0x60/0x94)
        [<c0343cc0>] (bus_for_each_drv) from [<c03455d8>] (__device_attach+0xb0/0x114)
        [<c03455d8>] (__device_attach) from [<c0344ab8>] (bus_probe_device+0x84/0x8c)
        [<c0344ab8>] (bus_probe_device) from [<c0344f48>] (deferred_probe_work_func+0x68/0x98)
        [<c0344f48>] (deferred_probe_work_func) from [<c003b738>] (process_one_work+0x120/0x3f8)
        [<c003b738>] (process_one_work) from [<c003ba48>] (worker_thread+0x38/0x55c)
        [<c003ba48>] (worker_thread) from [<c0040f14>] (kthread+0xdc/0xf4)
        [<c0040f14>] (kthread) from [<c000f778>] (ret_from_fork+0x14/0x3c)
      Fix it by marking workqueues created via create*_workqueue() with
      __WQ_LEGACY and disabling flush dependency checks on them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarThierry Reding <thierry.reding@gmail.com>
      Link: http://lkml.kernel.org/g/20160126173843.GA11115@ulmo.nvidia.com
      Fixes: fca839c0 ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")
  20. 07 Jan, 2016 1 commit
  21. 08 Dec, 2015 2 commits
    • Tejun Heo's avatar
      workqueue: implement lockup detector · 82607adc
      Tejun Heo authored
      Workqueue stalls can happen from a variety of usage bugs such as
      missing WQ_MEM_RECLAIM flag or concurrency managed work item
      indefinitely staying RUNNING.  These stalls can be extremely difficult
      to hunt down because the usual warning mechanisms can't detect
      workqueue stalls and the internal state is pretty opaque.
      To alleviate the situation, this patch implements workqueue lockup
      detector.  It periodically monitors all worker_pools periodically and,
      if any pool failed to make forward progress longer than the threshold
      duration, triggers warning and dumps workqueue state as follows.
       BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
       Showing busy workqueues and worker pools:
       workqueue events: flags=0x0
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
           pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
       workqueue events_power_efficient: flags=0x80
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
           pending: check_lifetime, neigh_periodic_work
       workqueue cgroup_pidlist_destroy: flags=0x0
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
           pending: cgroup_pidlist_destroy_work_fn
      The detection mechanism is controller through kernel parameter
      workqueue.watchdog_thresh and can be updated at runtime through the
      sysfs module parameter file.
      v2: Decoupled from softlockup control knobs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
    • Tejun Heo's avatar
      workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue · fca839c0
      Tejun Heo authored
      Task or work item involved in memory reclaim trying to flush a
      non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
      deadlock.  Trigger WARN_ONCE() if such conditions are detected.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
  22. 12 Oct, 2015 1 commit
  23. 30 Sep, 2015 1 commit
    • Shaohua Li's avatar
      workqueue: make sure delayed work run in local cpu · 874bbfe6
      Shaohua Li authored
      My system keeps crashing with below message. vmstat_update() schedules a delayed
      work in current cpu and expects the work runs in the cpu.
      schedule_delayed_work() is expected to make delayed work run in local cpu. The
      problem is timer can be migrated with NO_HZ. __queue_work() queues work in
      timer handler, which could run in a different cpu other than where the delayed
      work is scheduled. The end result is the delayed work runs in different cpu.
      The patch makes __queue_delayed_work records local cpu earlier. Where the timer
      runs doesn't change where the work runs with the change.
      [   28.010131] ------------[ cut here ]------------
      [   28.010609] kernel BUG at ../mm/vmstat.c:1392!
      [   28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
      [   28.011860] Modules linked in:
      [   28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G        W4.3.0-rc3+ #634
      [   28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
      [   28.014160] Workqueue: events vmstat_update
      [   28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
      [   28.015445] RIP: 0010:[<ffffffff8115f921>]  [<ffffffff8115f921>]vmstat_update+0x31/0x80
      [   28.016282] RSP: 0018:ffff8800ba42fd80  EFLAGS: 00010297
      [   28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
      [   28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
      [   28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
      [   28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
      [   28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
      [   28.020071] FS:  0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
      [   28.020071] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
      [   28.020071] Stack:
      [   28.020071]  ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
      [   28.020071]  ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
      [   28.020071]  ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
      [   28.020071] Call Trace:
      [   28.020071]  [<ffffffff8106bd88>] process_one_work+0x1c8/0x540
      [   28.020071]  [<ffffffff8106bd0b>] ? process_one_work+0x14b/0x540
      [   28.020071]  [<ffffffff8106c214>] worker_thread+0x114/0x460
      [   28.020071]  [<ffffffff8106c100>] ? process_one_work+0x540/0x540
      [   28.020071]  [<ffffffff81071bf8>] kthread+0xf8/0x110
      [   28.020071]  [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200
      [   28.020071]  [<ffffffff81a6522f>] ret_from_fork+0x3f/0x70
      [   28.020071]  [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v2.6.31+
  24. 12 Aug, 2015 1 commit
    • Peter Zijlstra's avatar
      sched: Fix a race between __kthread_bind() and sched_setaffinity() · 25834c73
      Peter Zijlstra authored
      Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
      without locks, a caller might observe an old value and race with the
      set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
      						    if (p->flags & PF_NO_SETAFFINITIY)
      	  p->flags |= PF_NO_SETAFFINITY
      Fix the bug by putting everything under the regular scheduler locks.
      This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dedekind1@gmail.com
      Cc: juri.lelli@arm.com
      Cc: mgorman@suse.de
      Cc: riel@redhat.com
      Cc: rostedt@goodmis.org
      Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  25. 04 Aug, 2015 1 commit
  26. 22 Jul, 2015 1 commit
  27. 29 May, 2015 1 commit
  28. 28 May, 2015 1 commit
  29. 21 May, 2015 3 commits
  30. 19 May, 2015 2 commits
    • Lai Jiangshan's avatar
      workqueue: ensure attrs changes are properly synchronized · d4d3e257
      Lai Jiangshan authored
      Current modification to attrs via sysfs is not fully synchronized.
      Process A (change cpumask)      | Process B (change numa affinity)
      wq_cpumask_store()              |
        wq_sysfs_prep_attrs()         |
                                      | apply_workqueue_attrs()
        apply_workqueue_attrs()       |
      It results that the Process B's operation is totally reverted
      without any notification, it is a buggy behavior.  So this patch
      moves wq_sysfs_prep_attrs() into the protection under wq_pool_mutex
      to ensure attrs changes are properly synchronized.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Lai Jiangshan's avatar
      workqueue: separate out and refactor the locking of applying attrs · a0111cf6
      Lai Jiangshan authored
      Applying attrs requires two locks: get_online_cpus() and wq_pool_mutex,
      and this code is duplicated at two places (apply_workqueue_attrs() and
      workqueue_set_unbound_cpumask()).  So we separate out this locking
      code into apply_wqattrs_[un]lock() and do a minor refactor on
      The apply_wqattrs_[un]lock() will be also used on later patch for
      ensuring attrs changes are properly synchronized.
      tj: minor updates to comments
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  31. 18 May, 2015 2 commits
    • Lai Jiangshan's avatar
      workqueue: simplify wq_update_unbound_numa() · f7142ed4
      Lai Jiangshan authored
      wq_update_unbound_numa() is known be called with wq_pool_mutex held.
      But wq_update_unbound_numa() requests wq->mutex before reading
      wq->unbound_attrs, wq->numa_pwq_tbl[] and wq->dfl_pwq.  But these fields
      were changed to be allowed being read with wq_pool_mutex held.  So we
      simply remove the mutex_lock(&wq->mutex).
      Without the dependence on the the mutex_lock(&wq->mutex), the test
      of wq->unbound_attrs->no_numa can also be moved upward.
      The old code need a long comment to describe the stableness of
      @wq->unbound_attrs which is also guaranteed by wq_pool_mutex now,
      so we don't need this such comment.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Lai Jiangshan's avatar
      workqueue: wq_pool_mutex protects the attrs-installation · 5b95e1af
      Lai Jiangshan authored
      Current wq_pool_mutex doesn't proctect the attrs-installation, it results
      that ->unbound_attrs, ->numa_pwq_tbl[] and ->dfl_pwq can only be accessed
      under wq->mutex and causes some inconveniences. Example, wq_update_unbound_numa()
      has to acquire wq->mutex before fetching the wq->unbound_attrs->no_numa
      and the old_pwq.
      attrs-installation is a short operation, so this change will no cause any
      latency for other operations which also acquire the wq_pool_mutex.
      The only unprotected attrs-installation code is in apply_workqueue_attrs(),
      so this patch touches code less than comments.
      It is also a preparation patch for next several patches which read
      wq->unbound_attrs, wq->numa_pwq_tbl[] and wq->dfl_pwq with
      only wq_pool_mutex held.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>