Skip to content
Snippets Groups Projects
  1. Jan 21, 2019
    • Xie Yongji's avatar
      locking/rwsem: Fix (possible) missed wakeup · e158488b
      Xie Yongji authored
      
      Because wake_q_add() can imply an immediate wakeup (cmpxchg failure
      case), we must not rely on the wakeup being delayed. However, commit:
      
        e3851390 ("locking/rwsem: Rework zeroing reader waiter->task")
      
      relies on exactly that behaviour in that the wakeup must not happen
      until after we clear waiter->task.
      
      [ peterz: Added changelog. ]
      
      Signed-off-by: default avatarXie Yongji <xieyongji@baidu.com>
      Signed-off-by: default avatarZhang Yu <zhangyu31@baidu.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: e3851390 ("locking/rwsem: Rework zeroing reader waiter->task")
      Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e158488b
    • Peter Zijlstra's avatar
      futex: Fix (possible) missed wakeup · b061c38b
      Peter Zijlstra authored
      
      We must not rely on wake_q_add() to delay the wakeup; in particular
      commit:
      
        1d0dcb3a ("futex: Implement lockless wakeups")
      
      moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which
      could result in futex_wait() waking before observing ->lock_ptr ==
      NULL and going back to sleep again.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 1d0dcb3a ("futex: Implement lockless wakeups")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b061c38b
    • Peter Zijlstra's avatar
      sched/wake_q: Fix wakeup ordering for wake_q · 4c4e3731
      Peter Zijlstra authored
      
      Notable cmpxchg() does not provide ordering when it fails, however
      wake_q_add() requires ordering in this specific case too. Without this
      it would be possible for the concurrent wakeup to not observe our
      prior state.
      
      Andrea Parri provided:
      
        C wake_up_q-wake_q_add
      
        {
      	int next = 0;
      	int y = 0;
        }
      
        P0(int *next, int *y)
        {
      	int r0;
      
      	/* in wake_up_q() */
      
      	WRITE_ONCE(*next, 1);   /* node->next = NULL */
      	smp_mb();               /* implied by wake_up_process() */
      	r0 = READ_ONCE(*y);
        }
      
        P1(int *next, int *y)
        {
      	int r1;
      
      	/* in wake_q_add() */
      
      	WRITE_ONCE(*y, 1);      /* wake_cond = true */
      	smp_mb__before_atomic();
      	r1 = cmpxchg_relaxed(next, 1, 2);
        }
      
        exists (0:r0=0 /\ 1:r1=0)
      
        This "exists" clause cannot be satisfied according to the LKMM:
      
        Test wake_up_q-wake_q_add Allowed
        States 3
        0:r0=0; 1:r1=1;
        0:r0=1; 1:r1=0;
        0:r0=1; 1:r1=1;
        No
        Witnesses
        Positive: 0 Negative: 3
        Condition exists (0:r0=0 /\ 1:r1=0)
        Observation wake_up_q-wake_q_add Never 0 3
      
      Reported-by: default avatarYongji Xie <elohimes@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      4c4e3731
    • Peter Zijlstra's avatar
      sched/wake_q: Document wake_q_add() · e6018c0f
      Peter Zijlstra authored
      
      The only guarantee provided by wake_q_add() is that a wakeup will
      happen after it, it does _NOT_ guarantee the wakeup will be delayed
      until the matching wake_up_q().
      
      If wake_q_add() fails the cmpxchg() a concurrent wakeup is pending and
      that can happen at any time after the cmpxchg(). This means we should
      not rely on the wakeup happening at wake_q_up(), but should be ready
      for wake_q_add() to issue the wakeup.
      
      The delay; if provided (most likely); should only result in more efficient
      behaviour.
      
      Reported-by: default avatarYongji Xie <elohimes@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e6018c0f
    • Prateek Sood's avatar
      sched/wait: Fix rcuwait_wake_up() ordering · 6dc080ee
      Prateek Sood authored
      
      For some peculiar reason rcuwait_wake_up() has the right barrier in
      the comment, but not in the code.
      
      This mistake has been observed to cause a deadlock in the following
      situation:
      
          P1					P2
      
          percpu_up_read()			percpu_down_write()
            rcu_sync_is_idle() // false
      					  rcu_sync_enter()
      					  ...
            __percpu_up_read()
      
      [S] ,-  __this_cpu_dec(*sem->read_count)
          |   smp_rmb();
      [L] |   task = rcu_dereference(w->task) // NULL
          |
          |				    [S]	    w->task = current
          |					    smp_mb();
          |				    [L]	    readers_active_check() // fail
          `-> <store happens here>
      
      Where the smp_rmb() (obviously) fails to constrain the store.
      
      [ peterz: Added changelog. ]
      
      Signed-off-by: default avatarPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 8f95c90c ("sched/wait, RCU: Introduce rcuwait machinery")
      Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6dc080ee
  2. Jan 18, 2019
    • Daniel Borkmann's avatar
      bpf: fix inner map masking to prevent oob under speculation · 9d5564dd
      Daniel Borkmann authored
      
      During review I noticed that inner meta map setup for map in
      map is buggy in that it does not propagate all needed data
      from the reference map which the verifier is later accessing.
      
      In particular one such case is index masking to prevent out of
      bounds access under speculative execution due to missing the
      map's unpriv_array/index_mask field propagation. Fix this such
      that the verifier is generating the correct code for inlined
      lookups in case of unpriviledged use.
      
      Before patch (test_verifier's 'map in map access' dump):
      
        # bpftool prog dump xla id 3
           0: (62) *(u32 *)(r10 -4) = 0
           1: (bf) r2 = r10
           2: (07) r2 += -4
           3: (18) r1 = map[id:4]
           5: (07) r1 += 272                |
           6: (61) r0 = *(u32 *)(r2 +0)     |
           7: (35) if r0 >= 0x1 goto pc+6   | Inlined map in map lookup
           8: (54) (u32) r0 &= (u32) 0      | with index masking for
           9: (67) r0 <<= 3                 | map->unpriv_array.
          10: (0f) r0 += r1                 |
          11: (79) r0 = *(u64 *)(r0 +0)     |
          12: (15) if r0 == 0x0 goto pc+1   |
          13: (05) goto pc+1                |
          14: (b7) r0 = 0                   |
          15: (15) if r0 == 0x0 goto pc+11
          16: (62) *(u32 *)(r10 -4) = 0
          17: (bf) r2 = r10
          18: (07) r2 += -4
          19: (bf) r1 = r0
          20: (07) r1 += 272                |
          21: (61) r0 = *(u32 *)(r2 +0)     | Index masking missing (!)
          22: (35) if r0 >= 0x1 goto pc+3   | for inner map despite
          23: (67) r0 <<= 3                 | map->unpriv_array set.
          24: (0f) r0 += r1                 |
          25: (05) goto pc+1                |
          26: (b7) r0 = 0                   |
          27: (b7) r0 = 0
          28: (95) exit
      
      After patch:
      
        # bpftool prog dump xla id 1
           0: (62) *(u32 *)(r10 -4) = 0
           1: (bf) r2 = r10
           2: (07) r2 += -4
           3: (18) r1 = map[id:2]
           5: (07) r1 += 272                |
           6: (61) r0 = *(u32 *)(r2 +0)     |
           7: (35) if r0 >= 0x1 goto pc+6   | Same inlined map in map lookup
           8: (54) (u32) r0 &= (u32) 0      | with index masking due to
           9: (67) r0 <<= 3                 | map->unpriv_array.
          10: (0f) r0 += r1                 |
          11: (79) r0 = *(u64 *)(r0 +0)     |
          12: (15) if r0 == 0x0 goto pc+1   |
          13: (05) goto pc+1                |
          14: (b7) r0 = 0                   |
          15: (15) if r0 == 0x0 goto pc+12
          16: (62) *(u32 *)(r10 -4) = 0
          17: (bf) r2 = r10
          18: (07) r2 += -4
          19: (bf) r1 = r0
          20: (07) r1 += 272                |
          21: (61) r0 = *(u32 *)(r2 +0)     |
          22: (35) if r0 >= 0x1 goto pc+4   | Now fixed inlined inner map
          23: (54) (u32) r0 &= (u32) 0      | lookup with proper index masking
          24: (67) r0 <<= 3                 | for map->unpriv_array.
          25: (0f) r0 += r1                 |
          26: (05) goto pc+1                |
          27: (b7) r0 = 0                   |
          28: (b7) r0 = 0
          29: (95) exit
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9d5564dd
  3. Jan 17, 2019
  4. Jan 16, 2019
  5. Jan 15, 2019
    • Tycho Andersen's avatar
      seccomp: fix UAF in user-trap code · a811dc61
      Tycho Andersen authored
      
      On the failure path, we do an fput() of the listener fd if the filter fails
      to install (e.g. because of a TSYNC race that's lost, or if the thread is
      killed, etc.). fput() doesn't actually release the fd, it just ads it to a
      work queue. Then the thread proceeds to free the filter, even though the
      listener struct file has a reference to it.
      
      To fix this, on the failure path let's set the private data to null, so we
      know in ->release() to ignore the filter.
      
      Reported-by: default avatar <syzbot+981c26489b2d1c6316ba@syzkaller.appspotmail.com>
      Fixes: 6a21cc50 ("seccomp: add a return code to trap to userspace")
      Signed-off-by: default avatarTycho Andersen <tycho@tycho.ws>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJames Morris <james.morris@microsoft.com>
      a811dc61
    • Andrea Righi's avatar
      tracing/kprobes: Fix NULL pointer dereference in trace_kprobe_create() · 8b05a3a7
      Andrea Righi authored
      It is possible to trigger a NULL pointer dereference by writing an
      incorrectly formatted string to krpobe_events (trying to create a
      kretprobe omitting the symbol).
      
      Example:
      
       echo "r:event_1 " >> /sys/kernel/debug/tracing/kprobe_events
      
      That triggers this:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
       #PF error: [normal kernel read fault]
       PGD 0 P4D 0
       Oops: 0000 [#1] SMP PTI
       CPU: 6 PID: 1757 Comm: bash Not tainted 5.0.0-rc1+ #125
       Hardware name: Dell Inc. XPS 13 9370/0F6P3V, BIOS 1.5.1 08/09/2018
       RIP: 0010:kstrtoull+0x2/0x20
       Code: 28 00 00 00 75 17 48 83 c4 18 5b 41 5c 5d c3 b8 ea ff ff ff eb e1 b8 de ff ff ff eb da e8 d6 36 bb ff 66 0f 1f 44 00 00 31 c0 <80> 3f 2b 55 48 89 e5 0f 94 c0 48 01 c7 e8 5c ff ff ff 5d c3 66 2e
       RSP: 0018:ffffb5d482e57cb8 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff82b12720
       RDX: ffffb5d482e57cf8 RSI: 0000000000000000 RDI: 0000000000000000
       RBP: ffffb5d482e57d70 R08: ffffa0c05e5a7080 R09: ffffa0c05e003980
       R10: 0000000000000000 R11: 0000000040000000 R12: ffffa0c04fe87b08
       R13: 0000000000000001 R14: 000000000000000b R15: ffffa0c058d749e1
       FS:  00007f137c7f7740(0000) GS:ffffa0c05e580000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 0000000497d46004 CR4: 00000000003606e0
       Call Trace:
        ? trace_kprobe_create+0xb6/0x840
        ? _cond_resched+0x19/0x40
        ? _cond_resched+0x19/0x40
        ? __kmalloc+0x62/0x210
        ? argv_split+0x8f/0x140
        ? trace_kprobe_create+0x840/0x840
        ? trace_kprobe_create+0x840/0x840
        create_or_delete_trace_kprobe+0x11/0x30
        trace_run_command+0x50/0x90
        trace_parse_run_command+0xc1/0x160
        probes_write+0x10/0x20
        __vfs_write+0x3a/0x1b0
        ? apparmor_file_permission+0x1a/0x20
        ? security_file_permission+0x31/0xf0
        ? _cond_resched+0x19/0x40
        vfs_write+0xb1/0x1a0
        ksys_write+0x55/0xc0
        __x64_sys_write+0x1a/0x20
        do_syscall_64+0x5a/0x120
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fix by doing the proper argument checks in trace_kprobe_create().
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Link: https://lore.kernel.org/lkml/20190111095108.b79a2ee026185cbd62365977@kernel.org
      Link: http://lkml.kernel.org/r/20190111060113.GA22841@xps-13
      
      
      Fixes: 6212dd29 ("tracing/kprobes: Use dyn_event framework for kprobe events")
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarAndrea Righi <righi.andrea@gmail.com>
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      8b05a3a7
    • Thomas Gleixner's avatar
      posix-cpu-timers: Unbreak timer rearming · 93ad0fc0
      Thomas Gleixner authored
      
      The recent commit which prevented a division by 0 issue in the alarm timer
      code broke posix CPU timers as an unwanted side effect.
      
      The reason is that the common rearm code checks for timer->it_interval
      being 0 now. What went unnoticed is that the posix cpu timer setup does not
      initialize timer->it_interval as it stores the interval in CPU timer
      specific storage. The reason for the separate storage is historical as the
      posix CPU timers always had a 64bit nanoseconds representation internally
      while timer->it_interval is type ktime_t which used to be a modified
      timespec representation on 32bit machines.
      
      Instead of reverting the offending commit and fixing the alarmtimer issue
      in the alarmtimer code, store the interval in timer->it_interval at CPU
      timer setup time so the common code check works. This also repairs the
      existing inconistency of the posix CPU timer code which kept a single shot
      timer armed despite of the interval being 0.
      
      The separate storage can be removed in mainline, but that needs to be a
      separate commit as the current one has to be backported to stable kernels.
      
      Fixes: 0e334db6 ("posix-timers: Fix division by zero bug")
      Reported-by: default avatarH.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190111133500.840117406@linutronix.de
      93ad0fc0
    • Srinivas Ramana's avatar
      genirq: Make sure the initial affinity is not empty · bddda606
      Srinivas Ramana authored
      
      If all CPUs in the irq_default_affinity mask are offline when an interrupt
      is initialized then irq_setup_affinity() can set an empty affinity mask for
      a newly allocated interrupt.
      
      Fix this by falling back to cpu_online_mask in case the resulting affinity
      mask is zero.
      
      Signed-off-by: default avatarSrinivas Ramana <sramana@codeaurora.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-msm@vger.kernel.org
      Link: https://lkml.kernel.org/r/1545312957-8504-1-git-send-email-sramana@codeaurora.org
      bddda606
  6. Jan 13, 2019
    • Jonathan Neuschäfer's avatar
      kernel/sys.c: Clarify that UNAME26 does not generate unique versions anymore · b7285b42
      Jonathan Neuschäfer authored
      
      UNAME26 is a mechanism to report Linux's version as 2.6.x, for
      compatibility with old/broken software.  Due to the way it is
      implemented, it would have to be updated after 5.0, to keep the
      resulting versions unique.  Linus Torvalds argued:
      
       "Do we actually need this?
      
        I'd rather let it bitrot, and just let it return random versions. It
        will just start again at 2.4.60, won't it?
      
        Anybody who uses UNAME26 for a 5.x kernel might as well think it's
        still 4.x. The user space is so old that it can't possibly care about
        differences between 4.x and 5.x, can it?
      
        The only thing that matters is that it shows "2.4.<largeenough>",
        which it will do regardless"
      
      Signed-off-by: default avatarJonathan Neuschäfer <j.neuschaefer@gmx.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7285b42
  7. Jan 12, 2019
    • Taehee Yoo's avatar
      umh: add exit routine for UMH process · 73ab1cb2
      Taehee Yoo authored
      
      A UMH process which is created by the fork_usermode_blob() such as
      bpfilter needs to release members of the umh_info when process is
      terminated.
      But the do_exit() does not release members of the umh_info. hence module
      which uses UMH needs own code to detect whether UMH process is
      terminated or not.
      But this implementation needs extra code for checking the status of
      UMH process. it eventually makes the code more complex.
      
      The new PF_UMH flag is added and it is used to identify UMH processes.
      The exit_umh() does not release members of the umh_info.
      Hence umh_info->cleanup callback should release both members of the
      umh_info and the private data.
      
      Suggested-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73ab1cb2
  8. Jan 11, 2019
    • Yonghong Song's avatar
      bpf: fix bpffs bitfield pretty print · 17e3ac81
      Yonghong Song authored
      
      Commit 9d5f9f70 ("bpf: btf: fix struct/union/fwd types
      with kind_flag") introduced kind_flag and used bitfield_size
      in the btf_member to directly pretty print member values.
      
      The commit contained a bug where the incorrect parameters could be
      passed to function btf_bitfield_seq_show(). The bits_offset
      parameter in the function expects a value less than 8.
      Instead, the member offset in the structure is passed.
      
      The below is btf_bitfield_seq_show() func signature:
        void btf_bitfield_seq_show(void *data, u8 bits_offset,
                                   u8 nr_bits, struct seq_file *m)
      both bits_offset and nr_bits are u8 type. If the bitfield
      member offset is greater than 256, incorrect value will
      be printed.
      
      This patch fixed the issue by calculating correct proper
      data offset and bits_offset similar to non kind_flag case.
      
      Fixes: 9d5f9f70 ("bpf: btf: fix struct/union/fwd types with kind_flag")
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      17e3ac81
  9. Jan 10, 2019
  10. Jan 09, 2019
  11. Jan 08, 2019
    • David Herrmann's avatar
      fork: record start_time late · 7b558513
      David Herrmann authored
      
      This changes the fork(2) syscall to record the process start_time after
      initializing the basic task structure but still before making the new
      process visible to user-space.
      
      Technically, we could record the start_time anytime during fork(2).  But
      this might lead to scenarios where a start_time is recorded long before
      a process becomes visible to user-space.  For instance, with
      userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
      for an indefinite amount of time (and will, if this causes network
      access, or similar).
      
      By recording the start_time late, it much closer reflects the point in
      time where the process becomes live and can be observed by other
      processes.
      
      Lastly, this makes it much harder for user-space to predict and control
      the start_time they get assigned.  Previously, user-space could fork a
      process and stall it in copy_thread_tls() before its pid is allocated,
      but after its start_time is recorded.  This can be misused to later-on
      cycle through PIDs and resume the stalled fork(2) yielding a process
      that has the same pid and start_time as a process that existed before.
      This can be used to circumvent security systems that identify processes
      by their pid+start_time combination.
      
      Even though user-space was always aware that start_time recording is
      flaky (but several projects are known to still rely on start_time-based
      identification), changing the start_time to be recorded late will help
      mitigate existing attacks and make it much harder for user-space to
      control the start_time a process gets assigned.
      
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarTom Gundersen <teg@jklm.no>
      Signed-off-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b558513
  12. Jan 06, 2019
  13. Jan 05, 2019
  14. Jan 04, 2019
Loading