Skip to content
Snippets Groups Projects
  1. Apr 08, 2021
    • Sami Tolvanen's avatar
      add support for Clang CFI · cf68fffb
      Sami Tolvanen authored
      This change adds support for Clang’s forward-edge Control Flow
      Integrity (CFI) checking. With CONFIG_CFI_CLANG, the compiler
      injects a runtime check before each indirect function call to ensure
      the target is a valid function with the correct static type. This
      restricts possible call targets and makes it more difficult for
      an attacker to exploit bugs that allow the modification of stored
      function pointers. For more details, see:
      
        https://clang.llvm.org/docs/ControlFlowIntegrity.html
      
      
      
      Clang requires CONFIG_LTO_CLANG to be enabled with CFI to gain
      visibility to possible call targets. Kernel modules are supported
      with Clang’s cross-DSO CFI mode, which allows checking between
      independently compiled components.
      
      With CFI enabled, the compiler injects a __cfi_check() function into
      the kernel and each module for validating local call targets. For
      cross-module calls that cannot be validated locally, the compiler
      calls the global __cfi_slowpath_diag() function, which determines
      the target module and calls the correct __cfi_check() function. This
      patch includes a slowpath implementation that uses __module_address()
      to resolve call targets, and with CONFIG_CFI_CLANG_SHADOW enabled, a
      shadow map that speeds up module look-ups by ~3x.
      
      Clang implements indirect call checking using jump tables and
      offers two methods of generating them. With canonical jump tables,
      the compiler renames each address-taken function to <function>.cfi
      and points the original symbol to a jump table entry, which passes
      __cfi_check() validation. This isn’t compatible with stand-alone
      assembly code, which the compiler doesn’t instrument, and would
      result in indirect calls to assembly code to fail. Therefore, we
      default to using non-canonical jump tables instead, where the compiler
      generates a local jump table entry <function>.cfi_jt for each
      address-taken function, and replaces all references to the function
      with the address of the jump table entry.
      
      Note that because non-canonical jump table addresses are local
      to each component, they break cross-module function address
      equality. Specifically, the address of a global function will be
      different in each module, as it's replaced with the address of a local
      jump table entry. If this address is passed to a different module,
      it won’t match the address of the same function taken there. This
      may break code that relies on comparing addresses passed from other
      components.
      
      CFI checking can be disabled in a function with the __nocfi attribute.
      Additionally, CFI can be disabled for an entire compilation unit by
      filtering out CC_FLAGS_CFI.
      
      By default, CFI failures result in a kernel panic to stop a potential
      exploit. CONFIG_CFI_PERMISSIVE enables a permissive mode, where the
      kernel prints out a rate-limited warning instead, and allows execution
      to continue. This option is helpful for locating type mismatches, but
      should only be enabled during development.
      
      Signed-off-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210408182843.1754385-2-samitolvanen@google.com
      cf68fffb
  2. Apr 01, 2021
    • Steven Rostedt (VMware)'s avatar
      tracing: Fix stack trace event size · 9deb193a
      Steven Rostedt (VMware) authored
      Commit cbc3b92c fixed an issue to modify the macros of the stack trace
      event so that user space could parse it properly. Originally the stack
      trace format to user space showed that the called stack was a dynamic
      array. But it is not actually a dynamic array, in the way that other
      dynamic event arrays worked, and this broke user space parsing for it. The
      update was to make the array look to have 8 entries in it. Helper
      functions were added to make it parse it correctly, as the stack was
      dynamic, but was determined by the size of the event stored.
      
      Although this fixed user space on how it read the event, it changed the
      internal structure used for the stack trace event. It changed the array
      size from [0] to [8] (added 8 entries). This increased the size of the
      stack trace event by 8 words. The size reserved on the ring buffer was the
      size of the stack trace event plus the number of stack entries found in
      the stack trace. That commit caused the amount to be 8 more than what was
      needed because it did not expect the caller field to have any size. This
      produced 8 entries of garbage (and reading random data) from the stack
      trace event:
      
                <idle>-0       [002] d... 1976396.837549: <stack trace>
       => trace_event_raw_event_sched_switch
       => __traceiter_sched_switch
       => __schedule
       => schedule_idle
       => do_idle
       => cpu_startup_entry
       => secondary_startup_64_no_verify
       => 0xc8c5e150ffff93de
       => 0xffff93de
       => 0
       => 0
       => 0xc8c5e17800000000
       => 0x1f30affff93de
       => 0x00000004
       => 0x200000000
      
      Instead, subtract the size of the caller field from the size of the event
      to make sure that only the amount needed to store the stack trace is
      reserved.
      
      Link: https://lore.kernel.org/lkml/your-ad-here.call-01617191565-ext-9692@work.hours/
      
      
      
      Cc: stable@vger.kernel.org
      Fixes: cbc3b92c ("tracing: Set kernel_stack's caller size properly")
      Reported-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Tested-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Acked-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      9deb193a
  3. Mar 30, 2021
  4. Mar 27, 2021
  5. Mar 26, 2021
    • Jens Axboe's avatar
      kernel: don't call do_exit() for PF_IO_WORKER threads · 10442994
      Jens Axboe authored
      
      Right now we're never calling get_signal() from PF_IO_WORKER threads, but
      in preparation for doing so, don't handle a fatal signal for them. The
      workers have state they need to cleanup when exiting, so just return
      instead of calling do_exit() on their behalf. The threads themselves will
      detect a fatal signal and do proper shutdown.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      10442994
  6. Mar 25, 2021
  7. Mar 23, 2021
    • Lukasz Luba's avatar
      PM: EM: postpone creating the debugfs dir till fs_initcall · fb9d62b2
      Lukasz Luba authored
      
      The debugfs directory '/sys/kernel/debug/energy_model' is needed before
      the Energy Model registration can happen. With the recent change in
      debugfs subsystem it's not allowed to create this directory at early
      stage (core_initcall). Thus creating this directory would fail.
      
      Postpone the creation of the EM debug dir to later stage: fs_initcall.
      
      It should be safe since all clients: CPUFreq drivers, Devfreq drivers
      will be initialized in later stages.
      
      The custom debug log below prints the time of creation the EM debug dir
      at fs_initcall and successful registration of EMs at later stages.
      
      [    1.505717] energy_model: creating rootdir
      [    3.698307] cpu cpu0: EM: created perf domain
      [    3.709022] cpu cpu1: EM: created perf domain
      
      Fixes: 56348560 ("debugfs: do not attempt to create a new file before the filesystem is initalized")
      Reported-by: default avatarIonela Voinescu <ionela.voinescu@arm.com>
      Signed-off-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      fb9d62b2
  8. Mar 21, 2021
  9. Mar 20, 2021
    • Thomas Gleixner's avatar
      genirq: Disable interrupts for force threaded handlers · 81e2073c
      Thomas Gleixner authored
      
      With interrupt force threading all device interrupt handlers are invoked
      from kernel threads. Contrary to hard interrupt context the invocation only
      disables bottom halfs, but not interrupts. This was an oversight back then
      because any code like this will have an issue:
      
      thread(irq_A)
        irq_handler(A)
          spin_lock(&foo->lock);
      
      interrupt(irq_B)
        irq_handler(B)
          spin_lock(&foo->lock);
      
      This has been triggered with networking (NAPI vs. hrtimers) and console
      drivers where printk() happens from an interrupt which interrupted the
      force threaded handler.
      
      Now people noticed and started to change the spin_lock() in the handler to
      spin_lock_irqsave() which affects performance or add IRQF_NOTHREAD to the
      interrupt request which in turn breaks RT.
      
      Fix the root cause and not the symptom and disable interrupts before
      invoking the force threaded handler which preserves the regular semantics
      and the usefulness of the interrupt force threading as a general debugging
      tool.
      
      For not RT this is not changing much, except that during the execution of
      the threaded handler interrupts are delayed until the handler
      returns. Vs. scheduling and softirq processing there is no difference.
      
      For RT kernels there is no issue.
      
      Fixes: 8d32a307 ("genirq: Provide forced interrupt threading")
      Reported-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJohan Hovold <johan@kernel.org>
      Acked-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://lore.kernel.org/r/20210317143859.513307808@linutronix.de
      81e2073c
  10. Mar 19, 2021
  11. Mar 18, 2021
  12. Mar 17, 2021
    • Alexei Starovoitov's avatar
      bpf: Fix fexit trampoline. · e21aa341
      Alexei Starovoitov authored
      
      The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
      The synchronize_rcu_tasks() will not wait for such tasks to complete.
      In such case the trampoline image will be freed and when the task
      wakes up the return IP will point to freed memory causing the crash.
      Solve this by adding percpu_ref_get/put for the duration of trampoline
      and separate trampoline vs its image life times.
      The "half page" optimization has to be removed, since
      first_half->second_half->first_half transition cannot be guaranteed to
      complete in deterministic time. Every trampoline update becomes a new image.
      The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
      call_rcu_tasks. Together they will wait for the original function and
      trampoline asm to complete. The trampoline is patched from nop to jmp to skip
      fexit progs. They are freed independently from the trampoline. The image with
      fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
      will wait for both sleepable and non-sleepable progs to complete.
      
      Fixes: fec56f58 ("bpf: Introduce BPF trampoline")
      Reported-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: Paul E. McKenney <paulmck@kernel.org>  # for RCU
      Link: https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail.com
      e21aa341
    • Piotr Krysiuk's avatar
      bpf: Add sanity check for upper ptr_limit · 1b1597e6
      Piotr Krysiuk authored
      
      Given we know the max possible value of ptr_limit at the time of retrieving
      the latter, add basic assertions, so that the verifier can bail out if
      anything looks odd and reject the program. Nothing triggered this so far,
      but it also does not hurt to have these.
      
      Signed-off-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Co-developed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1b1597e6
    • Piotr Krysiuk's avatar
      bpf: Simplify alu_limit masking for pointer arithmetic · b5871dca
      Piotr Krysiuk authored
      
      Instead of having the mov32 with aux->alu_limit - 1 immediate, move this
      operation to retrieve_ptr_limit() instead to simplify the logic and to
      allow for subsequent sanity boundary checks inside retrieve_ptr_limit().
      This avoids in future that at the time of the verifier masking rewrite
      we'd run into an underflow which would not sign extend due to the nature
      of mov32 instruction.
      
      Signed-off-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Co-developed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b5871dca
    • Piotr Krysiuk's avatar
      bpf: Fix off-by-one for area size in creating mask to left · 10d2bb2e
      Piotr Krysiuk authored
      
      retrieve_ptr_limit() computes the ptr_limit for registers with stack and
      map_value type. ptr_limit is the size of the memory area that is still
      valid / in-bounds from the point of the current position and direction
      of the operation (add / sub). This size will later be used for masking
      the operation such that attempting out-of-bounds access in the speculative
      domain is redirected to remain within the bounds of the current map value.
      
      When masking to the right the size is correct, however, when masking to
      the left, the size is off-by-one which would lead to an incorrect mask
      and thus incorrect arithmetic operation in the non-speculative domain.
      Piotr found that if the resulting alu_limit value is zero, then the
      BPF_MOV32_IMM() from the fixup_bpf_calls() rewrite will end up loading
      0xffffffff into AX instead of sign-extending to the full 64 bit range,
      and as a result, this allows abuse for executing speculatively out-of-
      bounds loads against 4GB window of address space and thus extracting the
      contents of kernel memory via side-channel.
      
      Fixes: 979d63d5 ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Signed-off-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Co-developed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      10d2bb2e
    • Piotr Krysiuk's avatar
      bpf: Prohibit alu ops for pointer types not defining ptr_limit · f232326f
      Piotr Krysiuk authored
      
      The purpose of this patch is to streamline error propagation and in particular
      to propagate retrieve_ptr_limit() errors for pointer types that are not defining
      a ptr_limit such that register-based alu ops against these types can be rejected.
      
      The main rationale is that a gap has been identified by Piotr in the existing
      protection against speculatively out-of-bounds loads, for example, in case of
      ctx pointers, unprivileged programs can still perform pointer arithmetic. This
      can be abused to execute speculatively out-of-bounds loads without restrictions
      and thus extract contents of kernel memory.
      
      Fix this by rejecting unprivileged programs that attempt any pointer arithmetic
      on unprotected pointer types. The two affected ones are pointer to ctx as well
      as pointer to map. Field access to a modified ctx' pointer is rejected at a
      later point in time in the verifier, and 7c696732 ("bpf: Permit map_ptr
      arithmetic with opcode add and offset 0") only relevant for root-only use cases.
      Risk of unprivileged program breakage is considered very low.
      
      Fixes: 7c696732 ("bpf: Permit map_ptr arithmetic with opcode add and offset 0")
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Co-developed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f232326f
    • Waiman Long's avatar
      locking/ww_mutex: Simplify use_ww_ctx & ww_ctx handling · 5de2055d
      Waiman Long authored
      
      The use_ww_ctx flag is passed to mutex_optimistic_spin(), but the
      function doesn't use it. The frequent use of the (use_ww_ctx && ww_ctx)
      combination is repetitive.
      
      In fact, ww_ctx should not be used at all if !use_ww_ctx.  Simplify
      ww_mutex code by dropping use_ww_ctx from mutex_optimistic_spin() an
      clear ww_ctx if !use_ww_ctx. In this way, we can replace (use_ww_ctx &&
      ww_ctx) by just (ww_ctx).
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Link: https://lore.kernel.org/r/20210316153119.13802-2-longman@redhat.com
      5de2055d
  13. Mar 16, 2021
  14. Mar 14, 2021
  15. Mar 13, 2021
  16. Mar 10, 2021
  17. Mar 08, 2021
  18. Mar 06, 2021
Loading