1. 17 Aug, 2017 1 commit
    • Boqun Feng's avatar
      locking/lockdep: Explicitly initialize wq_barrier::done::map · 52fa5bc5
      Boqun Feng authored
      
      
      With the new lockdep crossrelease feature, which checks completions usage,
      a false positive is reported in the workqueue code:
      
      > Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released
      > Task   B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released
      > Task   C : acquired of lock#3 -> wait for completion of barr->done
      > (Task C is in lru_add_drain_all_cpuslocked())
      > Worker D : wait for wfc.work to be released -> will complete barr->done
      
      Such a dead lock can not happen because Task C's barr->done and Worker D's
      barr->done can not be the same instance.
      
      The reason of this false positive is we initialize all wq_barrier::done
      at insert_wq_barrier() via init_completion(), which makes them belong to
      the same lock class, therefore, impossible circles are reported.
      
      To fix this, explicitly initialize the lockdep map for wq_barrier::done
      in insert_wq_barrier(), so that the lock class key of wq_barrier::done
      is a subkey of the corresponding work_struct, as a result we won't build
      a dependency between a wq_barrier with a unrelated work, and we can
      differ wq barriers based on the related works, so the false positive
      above is avoided.
      
      Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE
      to make the code simple and away from unnecessary #ifdefs.
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarBoqun Feng <boqun.feng@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      52fa5bc5
  2. 10 Aug, 2017 1 commit
    • Byungchul Park's avatar
      locking/lockdep: Implement the 'crossrelease' feature · b09be676
      Byungchul Park authored
      
      
      Lockdep is a runtime locking correctness validator that detects and
      reports a deadlock or its possibility by checking dependencies between
      locks. It's useful since it does not report just an actual deadlock but
      also the possibility of a deadlock that has not actually happened yet.
      That enables problems to be fixed before they affect real systems.
      
      However, this facility is only applicable to typical locks, such as
      spinlocks and mutexes, which are normally released within the context in
      which they were acquired. However, synchronization primitives like page
      locks or completions, which are allowed to be released in any context,
      also create dependencies and can cause a deadlock.
      
      So lockdep should track these locks to do a better job. The 'crossrelease'
      implementation makes these primitives also be tracked.
      Signed-off-by: default avatarByungchul Park <byungchul.park@lge.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-6-git-send-email-byungchul.park@lge.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b09be676
  3. 28 Jul, 2017 1 commit
    • Michael Bringmann's avatar
      workqueue: Work around edge cases for calc of pool's cpumask · 1ad0f0a7
      Michael Bringmann authored
      
      
      There is an underlying assumption/trade-off in many layers of the Linux
      system that CPU <-> node mapping is static.  This is despite the presence
      of features like NUMA and 'hotplug' that support the dynamic addition/
      removal of fundamental system resources like CPUs and memory.  PowerPC
      systems, however, do provide extensive features for the dynamic change
      of resources available to a system.
      
      Currently, there is little or no synchronization protection around the
      updating of the CPU <-> node mapping, and the export/update of this
      information for other layers / modules.  In systems which can change
      this mapping during 'hotplug', like PowerPC, the information is changing
      underneath all layers that might reference it.
      
      This patch attempts to ensure that a valid, usable cpumask attribute
      is used by the workqueue infrastructure when setting up new resource
      pools.  It prevents a crash that has been observed when an 'empty'
      cpumask is passed along to the worker/task scheduling code.  It is
      intended as a temporary workaround until a more fundamental review and
      correction of the issue can be done.
      
      [With additions to the patch provided by Tejun Hao <tj@kernel.org>]
      Signed-off-by: default avatarMichael Bringmann <mwb@linux.vnet.ibm.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      1ad0f0a7
  4. 25 Jul, 2017 1 commit
    • Tejun Heo's avatar
      workqueue: implicit ordered attribute should be overridable · 0a94efb5
      Tejun Heo authored
      5c0338c6
      
       ("workqueue: restore WQ_UNBOUND/max_active==1 to be
      ordered") automatically enabled ordered attribute for unbound
      workqueues w/ max_active == 1.  Because ordered workqueues reject
      max_active and some attribute changes, this implicit ordered mode
      broke cases where the user creates an unbound workqueue w/ max_active
      == 1 and later explicitly changes the related attributes.
      
      This patch distinguishes explicit and implicit ordered setting and
      overrides from attribute changes if implict.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 5c0338c6 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
      0a94efb5
  5. 19 Jul, 2017 1 commit
    • Tejun Heo's avatar
      workqueue: restore WQ_UNBOUND/max_active==1 to be ordered · 5c0338c6
      Tejun Heo authored
      The combination of WQ_UNBOUND and max_active == 1 used to imply
      ordered execution.  After NUMA affinity 4c16bd32
      
       ("workqueue:
      implement NUMA affinity for unbound workqueues"), this is no longer
      true due to per-node worker pools.
      
      While the right way to create an ordered workqueue is
      alloc_ordered_workqueue(), the documentation has been misleading for a
      long time and people do use WQ_UNBOUND and max_active == 1 for ordered
      workqueues which can lead to subtle bugs which are very difficult to
      trigger.
      
      It's unlikely that we'd see noticeable performance impact by enforcing
      ordering on WQ_UNBOUND / max_active == 1 workqueues.  Let's
      automatically set __WQ_ORDERED for those workqueues.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarChristoph Hellwig <hch@infradead.org>
      Reported-by: default avatarAlexei Potashnik <alexei@purestorage.com>
      Fixes: 4c16bd32 ("workqueue: implement NUMA affinity for unbound workqueues")
      Cc: stable@vger.kernel.org # v3.10+
      5c0338c6
  6. 20 Jun, 2017 1 commit
    • Ingo Molnar's avatar
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar authored
      
      
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ac6424b9
  7. 15 Apr, 2017 1 commit
    • Thomas Gleixner's avatar
      workqueue: Provide work_on_cpu_safe() · 0e8d6a93
      Thomas Gleixner authored
      
      
      work_on_cpu() is not protected against CPU hotplug. For code which requires
      to be either executed on an online CPU or to fail if the CPU is not
      available the callsite would have to protect against CPU hotplug.
      
      Provide a function which does get/put_online_cpus() around the call to
      work_on_cpu() and fails the call with -ENODEV if the target CPU is not
      online.
      
      Preparatory patch to convert several racy task affinity manipulations.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Len Brown <lenb@kernel.org>
      Link: http://lkml.kernel.org/r/20170412201042.262610721@linutronix.de
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      0e8d6a93
  8. 06 Mar, 2017 2 commits
  9. 10 Feb, 2017 1 commit
    • Kees Cook's avatar
      time: Remove CONFIG_TIMER_STATS · dfb4357d
      Kees Cook authored
      
      
      Currently CONFIG_TIMER_STATS exposes process information across namespaces:
      
      kernel/time/timer_list.c print_timer():
      
              SEQ_printf(m, ", %s/%d", tmp, timer->start_pid);
      
      /proc/timer_list:
      
       #11: <0000000000000000>, hrtimer_wakeup, S:01, do_nanosleep, cron/2570
      
      Given that the tracer can give the same information, this patch entirely
      removes CONFIG_TIMER_STATS.
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: linux-doc@vger.kernel.org
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Xing Gao <xgao01@email.wm.edu>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Jessica Frazelle <me@jessfraz.com>
      Cc: kernel-hardening@lists.openwall.com
      Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-api@vger.kernel.org
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      dfb4357d
  10. 19 Oct, 2016 1 commit
    • Tejun Heo's avatar
      workqueue: move wq_numa_init() to workqueue_init() · 2186d9f9
      Tejun Heo authored
      
      
      While splitting up workqueue initialization into two parts,
      ac8f73400782 ("workqueue: make workqueue available early during boot")
      put wq_numa_init() into workqueue_init_early().  Unfortunately, on
      some archs including power and arm64, cpu to node mapping isn't yet
      established by the time the early init is called leading to incorrect
      NUMA initialization and subsequently the following oops due to zero
      cpumask on node-specific unbound pools.
      
        Unable to handle kernel paging request for data at address 0x00000038
        Faulting instruction address: 0xc0000000000fc0cc
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA PowerNV
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
        task: c0000007f5400000 task.stack: c000001ffc084000
        NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
        REGS: c000001ffc087780 TRAP: 0300   Not tainted  (4.8.0-compiler_gcc-6.2.0-next-20161005)
        MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 48000424  XER: 00000000
        CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
        GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
        GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
        GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
        GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
        GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
        GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
        GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
        GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
        NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
        LR [c0000000000ed928] activate_task+0x78/0xe0
        Call Trace:
        [c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
        [c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
        [c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
        [c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
        [c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
        [c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
        [c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
        [c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
        [c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
        Instruction dump:
        62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
        60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
        ---[ end trace 0000000000000000 ]---
      
      Fix it by moving wq_numa_init() to workqueue_init().  As this means
      that the early intialization may not have full NUMA info for per-cpu
      pools and ignores NUMA affinity for unbound pools, fix them up from
      workqueue_init() after wq_numa_init().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
      
      
      Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      2186d9f9
  11. 11 Oct, 2016 1 commit
  12. 17 Sep, 2016 2 commits
    • Tejun Heo's avatar
      workqueue: remove keventd_up() · 863b710b
      Tejun Heo authored
      
      
      keventd_up() no longer has in-kernel users.  Remove it and make
      wq_online static.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      863b710b
    • Tejun Heo's avatar
      workqueue: make workqueue available early during boot · 3347fa09
      Tejun Heo authored
      
      
      Workqueue is currently initialized in an early init call; however,
      there are cases where early boot code has to be split and reordered to
      come after workqueue initialization or the same code path which makes
      use of workqueues is used both before workqueue initailization and
      after.  The latter cases have to gate workqueue usages with
      keventd_up() tests, which is nasty and easy to get wrong.
      
      Workqueue usages have become widespread and it'd be a lot more
      convenient if it can be used very early from boot.  This patch splits
      workqueue initialization into two steps.  workqueue_init_early() which
      sets up the basic data structures so that workqueues can be created
      and work items queued, and workqueue_init() which actually brings up
      workqueues online and starts executing queued work items.  The former
      step can be done very early during boot once memory allocation,
      cpumasks and idr are initialized.  The latter right after kthreads
      become available.
      
      This allows work item queueing and canceling from very early boot
      which is what most of these use cases want.
      
      * As systemd_wq being initialized doesn't indicate that workqueue is
        fully online anymore, update keventd_up() to test wq_online instead.
        The follow-up patches will get rid of all its usages and the
        function itself.
      
      * Flushing doesn't make sense before workqueue is fully initialized.
        The flush functions trigger WARN and return immediately before fully
        online.
      
      * Work items are never in-flight before fully online.  Canceling can
        always succeed by skipping the flush step.
      
      * Some code paths can no longer assume to be called with irq enabled
        as irq is disabled during early boot.  Use irqsave/restore
        operations instead.
      
      v2: Watchdog init, which requires timer to be running, moved from
          workqueue_init_early() to workqueue_init().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com
      3347fa09
  13. 16 Sep, 2016 1 commit
  14. 29 Aug, 2016 1 commit
  15. 14 Jul, 2016 1 commit
  16. 01 Jul, 2016 1 commit
  17. 16 Jun, 2016 1 commit
    • Peter Zijlstra's avatar
      workqueue: Fix setting affinity of unbound worker threads · d945b5e9
      Peter Zijlstra authored
      With commit e9d867a6
      
       ("sched: Allow per-cpu kernel threads to
      run on online && !active"), __set_cpus_allowed_ptr() expects that only
      strict per-cpu kernel threads can have affinity to an online CPU which
      is not yet active.
      
      This assumption is currently broken in the CPU_ONLINE notification
      handler for the workqueues where restore_unbound_workers_cpumask()
      calls set_cpus_allowed_ptr() when the first cpu in the unbound
      worker's pool->attr->cpumask comes online. Since
      set_cpus_allowed_ptr() is called with pool->attr->cpumask in which
      only one CPU is online which is not yet active, we get the following
      WARN_ON during an CPU online operation.
      
      ------------[ cut here ]------------
      WARNING: CPU: 40 PID: 248 at kernel/sched/core.c:1166
      __set_cpus_allowed_ptr+0x228/0x2e0
      Modules linked in:
      CPU: 40 PID: 248 Comm: cpuhp/40 Not tainted 4.6.0-autotest+ #4
      <..snip..>
      Call Trace:
      [c000000f273ff920] [c00000000010493c] __set_cpus_allowed_ptr+0x2cc/0x2e0 (unreliable)
      [c000000f273ffac0] [c0000000000ed4b0] workqueue_cpu_up_callback+0x2c0/0x470
      [c000000f273ffb70] [c0000000000f5c58] notifier_call_chain+0x98/0x100
      [c000000f273ffbc0] [c0000000000c5ed0] __cpu_notify+0x70/0xe0
      [c000000f273ffc00] [c0000000000c6028] notify_online+0x38/0x50
      [c000000f273ffc30] [c0000000000c5214] cpuhp_invoke_callback+0x84/0x250
      [c000000f273ffc90] [c0000000000c562c] cpuhp_up_callbacks+0x5c/0x120
      [c000000f273ffce0] [c0000000000c64d4] cpuhp_thread_fun+0x184/0x1c0
      [c000000f273ffd20] [c0000000000fa050] smpboot_thread_fn+0x290/0x2a0
      [c000000f273ffd80] [c0000000000f45b0] kthread+0x110/0x130
      [c000000f273ffe30] [c000000000009570] ret_from_kernel_thread+0x5c/0x6c
      ---[ end trace 00f1456578b2a3b2 ]---
      
      This patch fixes this by limiting the mask to the intersection of
      the pool affinity and online CPUs.
      
      Changelog-cribbed-from: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reported-by: default avatarAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d945b5e9
  18. 20 May, 2016 2 commits
  19. 12 May, 2016 1 commit
    • Wanpeng Li's avatar
      workqueue: fix rebind bound workers warning · f7c17d26
      Wanpeng Li authored
      
      
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
      Modules linked in:
      CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
      Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
       0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
       0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
       ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
      Call Trace:
       dump_stack+0x89/0xd4
       __warn+0xfd/0x120
       warn_slowpath_null+0x1d/0x20
       rebind_workers+0x1c0/0x1d0
       workqueue_cpu_up_callback+0xf5/0x1d0
       notifier_call_chain+0x64/0x90
       ? trace_hardirqs_on_caller+0xf2/0x220
       ? notify_prepare+0x80/0x80
       __raw_notifier_call_chain+0xe/0x10
       __cpu_notify+0x35/0x50
       notify_down_prepare+0x5e/0x80
       ? notify_prepare+0x80/0x80
       cpuhp_invoke_callback+0x73/0x330
       ? __schedule+0x33e/0x8a0
       cpuhp_down_callbacks+0x51/0xc0
       cpuhp_thread_fun+0xc1/0xf0
       smpboot_thread_fn+0x159/0x2a0
       ? smpboot_create_threads+0x80/0x80
       kthread+0xef/0x110
       ? wait_for_completion+0xf0/0x120
       ? schedule_tail+0x35/0xf0
       ret_from_fork+0x22/0x50
       ? __init_kthread_worker+0x70/0x70
      ---[ end trace eb12ae47d2382d8f ]---
      notify_down_prepare: attempt to take down CPU 0 failed
      
      This bug can be reproduced by below config w/ nohz_full= all cpus:
      
      CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
      CONFIG_DEBUG_HOTPLUG_CPU0=y
      CONFIG_NO_HZ_FULL=y
      
      As Thomas pointed out:
      
      | If a down prepare callback fails, then DOWN_FAILED is invoked for all
      | callbacks which have successfully executed DOWN_PREPARE.
      |
      | But, workqueue has actually two notifiers. One which handles
      | UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
      |
      | Now look at the priorities of those callbacks:
      |
      | CPU_PRI_WORKQUEUE_UP        = 5
      | CPU_PRI_WORKQUEUE_DOWN      = -5
      |
      | So the call order on DOWN_PREPARE is:
      |
      | CB 1
      | CB ...
      | CB workqueue_up() -> Ignores DOWN_PREPARE
      | CB ...
      | CB X ---> Fails
      |
      | So we call up to CB X with DOWN_FAILED
      |
      | CB 1
      | CB ...
      | CB workqueue_up() -> Handles DOWN_FAILED
      | CB ...
      | CB X-1
      |
      | So the problem is that the workqueue stuff handles DOWN_FAILED in the up
      | callback, while it should do it in the down callback. Which is not a good idea
      | either because it wants to be called early on rollback...
      |
      | Brilliant stuff, isn't it? The hotplug rework will solve this problem because
      | the callbacks become symetric, but for the existing mess, we need some
      | workaround in the workqueue code.
      
      The boot CPU handles housekeeping duty(unbound timers, workqueues,
      timekeeping, ...) on behalf of full dynticks CPUs. It must remain
      online when nohz full is enabled. There is a priority set to every
      notifier_blocks:
      
      workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down
      
      So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
      notifier_blocks behind tick_nohz_cpu_down will not be called any
      more, which leads to workers are actually not unbound. Then hotplug
      state machine will fallback to undo and online cpu 0 again. Workers
      will be rebound unconditionally even if they are not unbound and
      trigger the warning in this progress.
      
      This patch fix it by catching !DISASSOCIATED to avoid rebind bound
      workers.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      f7c17d26
  20. 26 Apr, 2016 1 commit
    • Roman Pen's avatar
      workqueue: fix ghost PENDING flag while doing MQ IO · 346c09f8
      Roman Pen authored
      
      
      The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
      with the following backtrace:
      
      [  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
      [  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
      [  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
      [  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
      [  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
      [  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
      [  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
      [  601.350965] Call Trace:
      [  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
      [  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
      [  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
      [  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
      [  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
      [  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
      [  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
      [  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
      [  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
      [  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
      [  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
      [  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
      [  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
      [  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
      [  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
      [  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
      [  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
      [  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
      [  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50
      
      The underlying device is a null_blk, with default parameters:
      
        queue_mode    = MQ
        submit_queues = 1
      
      Verification that nullb0 has something inflight:
      
      root@pserver8:~# cat /sys/block/nullb0/inflight
             0        1
      root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
      ...
      /sys/block/nullb0/mq/0/cpu2/rq_list
      CTX pending:
              ffff8838038e2400
      ...
      
      During debug it became clear that stalled request is always inserted in
      the rq_list from the following path:
      
         save_stack_trace_tsk + 34
         blk_mq_insert_requests + 231
         blk_mq_flush_plug_list + 281
         blk_flush_plug_list + 199
         wait_on_page_bit + 192
         __filemap_fdatawait_range + 228
         filemap_fdatawait_range + 20
         filemap_write_and_wait_range + 63
         blkdev_fsync + 27
         vfs_fsync_range + 73
         blkdev_write_iter + 202
         __vfs_write + 170
         vfs_write + 169
         kernel_write + 56
      
      So blk_flush_plug_list() was called with from_schedule == true.
      
      If from_schedule is true, that means that finally blk_mq_insert_requests()
      offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
      i.e. it calls kblockd_schedule_delayed_work_on().
      
      That means, that we race with another CPU, which is about to execute
      __blk_mq_run_hw_queue() work.
      
      Further debugging shows the following traces from different CPUs:
      
        CPU#0                                  CPU#1
        ----------------------------------     -------------------------------
        reqeust A inserted
        STORE hctx->ctx_map[0] bit marked
        kblockd_schedule...() returns 1
        <schedule to kblockd workqueue>
                                               request B inserted
                                               STORE hctx->ctx_map[1] bit marked
                                               kblockd_schedule...() returns 0
        *** WORK PENDING bit is cleared ***
        flush_busy_ctxs() is executed, but
        bit 1, set by CPU#1, is not observed
      
      As a result request B pended forever.
      
      This behaviour can be explained by speculative LOAD of hctx->ctx_map on
      CPU#0, which is reordered with clear of PENDING bit and executed _before_
      actual STORE of bit 1 on CPU#1.
      
      The proper fix is an explicit full barrier <mfence>, which guarantees
      that clear of PENDING bit is to be executed before all possible
      speculative LOADS or STORES inside actual work function.
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
      Cc: Michael Wang <yun.wang@profitbricks.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      346c09f8
  21. 15 Mar, 2016 1 commit
    • Peter Zijlstra's avatar
      tags: Fix DEFINE_PER_CPU expansions · 25528213
      Peter Zijlstra authored
      
      
      $ make tags
        GEN     tags
      ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
      ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
      ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
      ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
      ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
      ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
      ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
      
      Which are all the result of the DEFINE_PER_CPU pattern:
      
        scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
        scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
      
      The below cures them. All except the workqueue one are within reasonable
      distance of the 80 char limit. TJ do you have any preference on how to
      fix the wq one, or shall we just not care its too long?
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25528213
  22. 11 Mar, 2016 1 commit
  23. 02 Mar, 2016 1 commit
  24. 17 Feb, 2016 1 commit
  25. 10 Feb, 2016 1 commit
    • Tejun Heo's avatar
      workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup · d6e022f1
      Tejun Heo authored
      When looking up the pool_workqueue to use for an unbound workqueue,
      workqueue assumes that the target CPU is always bound to a valid NUMA
      node.  However, currently, when a CPU goes offline, the mapping is
      destroyed and cpu_to_node() returns NUMA_NO_NODE.
      
      This has always been broken but hasn't triggered often enough before
      874bbfe6 ("workqueue: make sure delayed work run in local cpu").
      After the commit, workqueue forcifully assigns the local CPU for
      delayed work items without explicit target CPU to fix a different
      issue.  This widens the window where CPU can go offline while a
      delayed work item is pending causing delayed work items dispatched
      with target CPU set to an already offlined CPU.  The resulting
      NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
      NULL pool_workqueue and thus crash.
      
      While 874bbfe6
      
       has been reverted for a different reason making the
      bug less visible again, it can still happen.  Fix it by mapping
      NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
      This is a temporary workaround.  The long term solution is keeping CPU
      -> NODE mapping stable across CPU off/online cycles which is being
      worked on.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
      Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
      d6e022f1
  26. 09 Feb, 2016 3 commits
    • Tejun Heo's avatar
      workqueue: implement "workqueue.debug_force_rr_cpu" debug feature · f303fccb
      Tejun Heo authored
      Workqueue used to guarantee local execution for work items queued
      without explicit target CPU.  The guarantee is gone now which can
      break some usages in subtle ways.  To flush out those cases, this
      patch implements a debug feature which forces round-robin CPU
      selection for all such work items.
      
      The debug feature defaults to off and can be enabled with a kernel
      parameter.  The default can be flipped with a debug config option.
      
      If you hit this commit during bisection, please refer to 041bd12e
      
      
      ("Revert "workqueue: make sure delayed work run in local cpu"") for
      more information and ping me.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f303fccb
    • Mike Galbraith's avatar
      workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs · ef557180
      Mike Galbraith authored
      
      
      WORK_CPU_UNBOUND work items queued to a bound workqueue always run
      locally.  This is a good thing normally, but not when the user has
      asked us to keep unbound work away from certain CPUs.  Round robin
      these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
      trumps performance.
      
      tj: Cosmetic and comment changes.  WARN_ON_ONCE() dropped from empty
          (wq_unbound_cpumask AND cpu_online_mask).  If we want that, it
          should be done when config changes.
      Signed-off-by: default avatarMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      ef557180
    • Tejun Heo's avatar
      Revert "workqueue: make sure delayed work run in local cpu" · 041bd12e
      Tejun Heo authored
      This reverts commit 874bbfe6.
      
      Workqueue used to implicity guarantee that work items queued without
      explicit CPU specified are put on the local CPU.  Recent changes in
      timer broke the guarantee and led to vmstat breakage which was fixed
      by 176bed1d ("vmstat: explicitly schedule per-cpu work on the CPU
      we need it to run on").
      
      vmstat is the most likely to expose the issue and it's quite possible
      that there are other similar problems which are a lot more difficult
      to trigger.  As a preventive measure, 874bbfe6 ("workqueue: make
      sure delayed work run in local cpu") was applied to restore the local
      CPU guarnatee.  Unfortunately, the change exposed a bug in timer code
      which got fixed by 22b886dd ("timers: Use proper base migration in
      add_timer_on()").  Due to code restructuring, the commit couldn't be
      backported beyond certain point and stable kernels which only had
      874bbfe6 started crashing.
      
      The local CPU guarantee was accidental more than anything else and we
      want to get rid of it anyway.  As, with the vmstat case fixed,
      874bbfe6
      
       is causing more problems than it's fixing, it has been
      decided to take the chance and officially break the guarantee by
      reverting the commit.  A debug feature will be added to force foreign
      CPU assignment to expose cases relying on the guarantee and fixes for
      the individual cases will be backported to stable as necessary.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 874bbfe6 ("workqueue: make sure delayed work run in local cpu")
      Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
      Cc: stable@vger.kernel.org
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      041bd12e
  27. 29 Jan, 2016 1 commit
    • Tejun Heo's avatar
      workqueue: skip flush dependency checks for legacy workqueues · 23d11a58
      Tejun Heo authored
      fca839c0
      
       ("workqueue: warn if memory reclaim tries to flush
      !WQ_MEM_RECLAIM workqueue") implemented flush dependency warning which
      triggers if a PF_MEMALLOC task or WQ_MEM_RECLAIM workqueue tries to
      flush a !WQ_MEM_RECLAIM workquee.
      
      This assumes that workqueues marked with WQ_MEM_RECLAIM sit in memory
      reclaim path and making it depend on something which may need more
      memory to make forward progress can lead to deadlocks.  Unfortunately,
      workqueues created with the legacy create*_workqueue() interface
      always have WQ_MEM_RECLAIM regardless of whether they are depended
      upon memory reclaim or not.  These spurious WQ_MEM_RECLAIM markings
      cause spurious triggering of the flush dependency checks.
      
        WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
        workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
        ...
        Workqueue: deferwq deferred_probe_work_func
        [<c0017acc>] (unwind_backtrace) from [<c0013134>] (show_stack+0x10/0x14)
        [<c0013134>] (show_stack) from [<c0245f18>] (dump_stack+0x94/0xd4)
        [<c0245f18>] (dump_stack) from [<c0026f9c>] (warn_slowpath_common+0x80/0xb0)
        [<c0026f9c>] (warn_slowpath_common) from [<c0026ffc>] (warn_slowpath_fmt+0x30/0x40)
        [<c0026ffc>] (warn_slowpath_fmt) from [<c00390b8>] (check_flush_dependency+0x138/0x144)
        [<c00390b8>] (check_flush_dependency) from [<c0039ca0>] (flush_work+0x50/0x15c)
        [<c0039ca0>] (flush_work) from [<c00c51b0>] (lru_add_drain_all+0x130/0x180)
        [<c00c51b0>] (lru_add_drain_all) from [<c00f728c>] (migrate_prep+0x8/0x10)
        [<c00f728c>] (migrate_prep) from [<c00bfbc4>] (alloc_contig_range+0xd8/0x338)
        [<c00bfbc4>] (alloc_contig_range) from [<c00f8f18>] (cma_alloc+0xe0/0x1ac)
        [<c00f8f18>] (cma_alloc) from [<c001cac4>] (__alloc_from_contiguous+0x38/0xd8)
        [<c001cac4>] (__alloc_from_contiguous) from [<c001ceb4>] (__dma_alloc+0x240/0x278)
        [<c001ceb4>] (__dma_alloc) from [<c001cf78>] (arm_dma_alloc+0x54/0x5c)
        [<c001cf78>] (arm_dma_alloc) from [<c0355ea4>] (dmam_alloc_coherent+0xc0/0xec)
        [<c0355ea4>] (dmam_alloc_coherent) from [<c039cc4c>] (ahci_port_start+0x150/0x1dc)
        [<c039cc4c>] (ahci_port_start) from [<c0384734>] (ata_host_start.part.3+0xc8/0x1c8)
        [<c0384734>] (ata_host_start.part.3) from [<c03898dc>] (ata_host_activate+0x50/0x148)
        [<c03898dc>] (ata_host_activate) from [<c039d558>] (ahci_host_activate+0x44/0x114)
        [<c039d558>] (ahci_host_activate) from [<c039f05c>] (ahci_platform_init_host+0x1d8/0x3c8)
        [<c039f05c>] (ahci_platform_init_host) from [<c039e6bc>] (tegra_ahci_probe+0x448/0x4e8)
        [<c039e6bc>] (tegra_ahci_probe) from [<c0347058>] (platform_drv_probe+0x50/0xac)
        [<c0347058>] (platform_drv_probe) from [<c03458cc>] (driver_probe_device+0x214/0x2c0)
        [<c03458cc>] (driver_probe_device) from [<c0343cc0>] (bus_for_each_drv+0x60/0x94)
        [<c0343cc0>] (bus_for_each_drv) from [<c03455d8>] (__device_attach+0xb0/0x114)
        [<c03455d8>] (__device_attach) from [<c0344ab8>] (bus_probe_device+0x84/0x8c)
        [<c0344ab8>] (bus_probe_device) from [<c0344f48>] (deferred_probe_work_func+0x68/0x98)
        [<c0344f48>] (deferred_probe_work_func) from [<c003b738>] (process_one_work+0x120/0x3f8)
        [<c003b738>] (process_one_work) from [<c003ba48>] (worker_thread+0x38/0x55c)
        [<c003ba48>] (worker_thread) from [<c0040f14>] (kthread+0xdc/0xf4)
        [<c0040f14>] (kthread) from [<c000f778>] (ret_from_fork+0x14/0x3c)
      
      Fix it by marking workqueues created via create*_workqueue() with
      __WQ_LEGACY and disabling flush dependency checks on them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarThierry Reding <thierry.reding@gmail.com>
      Link: http://lkml.kernel.org/g/20160126173843.GA11115@ulmo.nvidia.com
      Fixes: fca839c0 ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")
      23d11a58
  28. 07 Jan, 2016 1 commit
  29. 08 Dec, 2015 2 commits
    • Tejun Heo's avatar
      workqueue: implement lockup detector · 82607adc
      Tejun Heo authored
      
      
      Workqueue stalls can happen from a variety of usage bugs such as
      missing WQ_MEM_RECLAIM flag or concurrency managed work item
      indefinitely staying RUNNING.  These stalls can be extremely difficult
      to hunt down because the usual warning mechanisms can't detect
      workqueue stalls and the internal state is pretty opaque.
      
      To alleviate the situation, this patch implements workqueue lockup
      detector.  It periodically monitors all worker_pools periodically and,
      if any pool failed to make forward progress longer than the threshold
      duration, triggers warning and dumps workqueue state as follows.
      
       BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
       Showing busy workqueues and worker pools:
       workqueue events: flags=0x0
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
           pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
       workqueue events_power_efficient: flags=0x80
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
           pending: check_lifetime, neigh_periodic_work
       workqueue cgroup_pidlist_destroy: flags=0x0
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
           pending: cgroup_pidlist_destroy_work_fn
       ...
      
      The detection mechanism is controller through kernel parameter
      workqueue.watchdog_thresh and can be updated at runtime through the
      sysfs module parameter file.
      
      v2: Decoupled from softlockup control knobs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      82607adc
    • Tejun Heo's avatar
      workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue · fca839c0
      Tejun Heo authored
      
      
      Task or work item involved in memory reclaim trying to flush a
      non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
      deadlock.  Trigger WARN_ONCE() if such conditions are detected.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      fca839c0
  30. 12 Oct, 2015 1 commit
  31. 30 Sep, 2015 1 commit
    • Shaohua Li's avatar
      workqueue: make sure delayed work run in local cpu · 874bbfe6
      Shaohua Li authored
      
      
      My system keeps crashing with below message. vmstat_update() schedules a delayed
      work in current cpu and expects the work runs in the cpu.
      schedule_delayed_work() is expected to make delayed work run in local cpu. The
      problem is timer can be migrated with NO_HZ. __queue_work() queues work in
      timer handler, which could run in a different cpu other than where the delayed
      work is scheduled. The end result is the delayed work runs in different cpu.
      The patch makes __queue_delayed_work records local cpu earlier. Where the timer
      runs doesn't change where the work runs with the change.
      
      [   28.010131] ------------[ cut here ]------------
      [   28.010609] kernel BUG at ../mm/vmstat.c:1392!
      [   28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
      [   28.011860] Modules linked in:
      [   28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G        W4.3.0-rc3+ #634
      [   28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
      [   28.014160] Workqueue: events vmstat_update
      [   28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
      [   28.015445] RIP: 0010:[<ffffffff8115f921>]  [<ffffffff8115f921>]vmstat_update+0x31/0x80
      [   28.016282] RSP: 0018:ffff8800ba42fd80  EFLAGS: 00010297
      [   28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
      [   28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
      [   28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
      [   28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
      [   28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
      [   28.020071] FS:  0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
      [   28.020071] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
      [   28.020071] Stack:
      [   28.020071]  ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
      [   28.020071]  ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
      [   28.020071]  ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
      [   28.020071] Call Trace:
      [   28.020071]  [<ffffffff8106bd88>] process_one_work+0x1c8/0x540
      [   28.020071]  [<ffffffff8106bd0b>] ? process_one_work+0x14b/0x540
      [   28.020071]  [<ffffffff8106c214>] worker_thread+0x114/0x460
      [   28.020071]  [<ffffffff8106c100>] ? process_one_work+0x540/0x540
      [   28.020071]  [<ffffffff81071bf8>] kthread+0xf8/0x110
      [   28.020071]  [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200
      [   28.020071]  [<ffffffff81a6522f>] ret_from_fork+0x3f/0x70
      [   28.020071]  [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v2.6.31+
      874bbfe6
  32. 12 Aug, 2015 1 commit
    • Peter Zijlstra's avatar
      sched: Fix a race between __kthread_bind() and sched_setaffinity() · 25834c73
      Peter Zijlstra authored
      
      
      Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
      without locks, a caller might observe an old value and race with the
      set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
      it:
      
      	__kthread_bind()
      	  do_set_cpus_allowed()
      						<SYSCALL>
      						  sched_setaffinity()
      						    if (p->flags & PF_NO_SETAFFINITIY)
      						    set_cpus_allowed_ptr()
      	  p->flags |= PF_NO_SETAFFINITY
      
      Fix the bug by putting everything under the regular scheduler locks.
      
      This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dedekind1@gmail.com
      Cc: juri.lelli@arm.com
      Cc: mgorman@suse.de
      Cc: riel@redhat.com
      Cc: rostedt@goodmis.org
      Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      25834c73
  33. 04 Aug, 2015 1 commit
  34. 22 Jul, 2015 1 commit