Skip to content
Snippets Groups Projects
  1. Apr 08, 2021
    • Kees Cook's avatar
      stack: Optionally randomize kernel stack offset each syscall · 39218ff4
      Kees Cook authored
      This provides the ability for architectures to enable kernel stack base
      address offset randomization. This feature is controlled by the boot
      param "randomize_kstack_offset=on/off", with its default value set by
      CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
      
      This feature is based on the original idea from the last public release
      of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
      All the credit for the original idea goes to the PaX team. Note that
      the design and implementation of this upstream randomize_kstack_offset
      feature differs greatly from the RANDKSTACK feature (see below).
      
      Reasoning for the feature:
      
      This feature aims to make harder the various stack-based attacks that
      rely on deterministic stack structure. We have had many such attacks in
      past (just to name few):
      
      https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
      https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
      https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
      
      As Linux kernel stack protections have been constantly improving
      (vmap-based stack allocation with guard pages, removal of thread_info,
      STACKLEAK), attackers have had to find new ways for their exploits
      to work. They have done so, continuing to rely on the kernel's stack
      determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
      were not relevant. For example, the following recent attacks would have
      been hampered if the stack offset was non-deterministic between syscalls:
      
      https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
      (page 70: targeting the pt_regs copy with linear stack overflow)
      
      https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
      (leaked stack address from one syscall as a target during next syscall)
      
      The main idea is that since the stack offset is randomized on each system
      call, it is harder for an attack to reliably land in any particular place
      on the thread stack, even with address exposures, as the stack base will
      change on the next syscall. Also, since randomization is performed after
      placing pt_regs, the ptrace-based approach[1] to discover the randomized
      offset during a long-running syscall should not be possible.
      
      Design description:
      
      During most of the kernel's execution, it runs on the "thread stack",
      which is pretty deterministic in its structure: it is fixed in size,
      and on every entry from userspace to kernel on a syscall the thread
      stack starts construction from an address fetched from the per-cpu
      cpu_current_top_of_stack variable. The first element to be pushed to the
      thread stack is the pt_regs struct that stores all required CPU registers
      and syscall parameters. Finally the specific syscall function is called,
      with the stack being used as the kernel executes the resulting request.
      
      The goal of randomize_kstack_offset feature is to add a random offset
      after the pt_regs has been pushed to the stack and before the rest of the
      thread stack is used during the syscall processing, and to change it every
      time a process issues a syscall. The source of randomness is currently
      architecture-defined (but x86 is using the low byte of rdtsc()). Future
      improvements for different entropy sources is possible, but out of scope
      for this patch. Further more, to add more unpredictability, new offsets
      are chosen at the end of syscalls (the timing of which should be less
      easy to measure from userspace than at syscall entry time), and stored
      in a per-CPU variable, so that the life of the value does not stay
      explicitly tied to a single task.
      
      As suggested by Andy Lutomirski, the offset is added using alloca()
      and an empty asm() statement with an output constraint, since it avoids
      changes to assembly syscall entry code, to the unwinder, and provides
      correct stack alignment as defined by the compiler.
      
      In order to make this available by default with zero performance impact
      for those that don't want it, it is boot-time selectable with static
      branches. This way, if the overhead is not wanted, it can just be
      left turned off with no performance impact.
      
      The generated assembly for x86_64 with GCC looks like this:
      
      ...
      ffffffff81003977: 65 8b 05 02 ea 00 7f  mov %gs:0x7f00ea02(%rip),%eax
      					    # 12380 <kstack_offset>
      ffffffff8100397e: 25 ff 03 00 00        and $0x3ff,%eax
      ffffffff81003983: 48 83 c0 0f           add $0xf,%rax
      ffffffff81003987: 25 f8 07 00 00        and $0x7f8,%eax
      ffffffff8100398c: 48 29 c4              sub %rax,%rsp
      ffffffff8100398f: 48 8d 44 24 0f        lea 0xf(%rsp),%rax
      ffffffff81003994: 48 83 e0 f0           and $0xfffffffffffffff0,%rax
      ...
      
      As a result of the above stack alignment, this patch introduces about
      5 bits of randomness after pt_regs is spilled to the thread stack on
      x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
      stack alignment). The amount of entropy could be adjusted based on how
      much of the stack space we wish to trade for security.
      
      My measure of syscall performance overhead (on x86_64):
      
      lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
          randomize_kstack_offset=y	Simple syscall: 0.7082 microseconds
          randomize_kstack_offset=n	Simple syscall: 0.7016 microseconds
      
      So, roughly 0.9% overhead growth for a no-op syscall, which is very
      manageable. And for people that don't want this, it's off by default.
      
      There are two gotchas with using the alloca() trick. First,
      compilers that have Stack Clash protection (-fstack-clash-protection)
      enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
      any dynamic stack allocations. While the randomization offset is
      always less than a page, the resulting assembly would still contain
      (unreachable!) probing routines, bloating the resulting assembly. To
      avoid this, -fno-stack-clash-protection is unconditionally added to
      the kernel Makefile since this is the only dynamic stack allocation in
      the kernel (now that VLAs have been removed) and it is provably safe
      from Stack Clash style attacks.
      
      The second gotcha with alloca() is a negative interaction with
      -fstack-protector*, in that it sees the alloca() as an array allocation,
      which triggers the unconditional addition of the stack canary function
      pre/post-amble which slows down syscalls regardless of the static
      branch. In order to avoid adding this unneeded check and its associated
      performance impact, architectures need to carefully remove uses of
      -fstack-protector-strong (or -fstack-protector) in the compilation units
      that use the add_random_kstack() macro and to audit the resulting stack
      mitigation coverage (to make sure no desired coverage disappears). No
      change is visible for this on x86 because the stack protector is already
      unconditionally disabled for the compilation unit, but the change is
      required on arm64. There is, unfortunately, no attribute that can be
      used to disable stack protector for specific functions.
      
      Comparison to PaX RANDKSTACK feature:
      
      The RANDKSTACK feature randomizes the location of the stack start
      (cpu_current_top_of_stack), i.e. including the location of pt_regs
      structure itself on the stack. Initially this patch followed the same
      approach, but during the recent discussions[2], it has been determined
      to be of a little value since, if ptrace functionality is available for
      an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
      different offsets in the pt_regs struct, observe the cache behavior of
      the pt_regs accesses, and figure out the random stack offset. Another
      difference is that the random offset is stored in a per-cpu variable,
      rather than having it be per-thread. As a result, these implementations
      differ a fair bit in their implementation details and results, though
      obviously the intent is similar.
      
      [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
      [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
      [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.html
      
      
      
      Co-developed-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
      39218ff4
    • Kees Cook's avatar
      init_on_alloc: Optimize static branches · 51cba1eb
      Kees Cook authored
      
      The state of CONFIG_INIT_ON_ALLOC_DEFAULT_ON (and ...ON_FREE...) did not
      change the assembly ordering of the static branches: they were always out
      of line. Use the new jump_label macros to check the CONFIG settings to
      default to the "expected" state, which slightly optimizes the resulting
      assembly code.
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Link: https://lore.kernel.org/r/20210401232347.2791257-3-keescook@chromium.org
      51cba1eb
    • Kees Cook's avatar
      jump_label: Provide CONFIG-driven build state defaults · 0d66ccc1
      Kees Cook authored
      
      As shown in the comment in jump_label.h, choosing the initial state of
      static branches changes the assembly layout. If the condition is expected
      to be likely it's inline, and if unlikely it is out of line via a jump.
      
      A few places in the kernel use (or could be using) a CONFIG to choose the
      default state, which would give a small performance benefit to their
      compile-time declared default. Provide the infrastructure to do this.
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210401232347.2791257-2-keescook@chromium.org
      0d66ccc1
  2. Apr 04, 2021
    • Linus Torvalds's avatar
      Linux 5.12-rc6 · e49d033b
      Linus Torvalds authored
      v5.12-rc6
      e49d033b
    • Zheyu Ma's avatar
      firewire: nosy: Fix a use-after-free bug in nosy_ioctl() · 829933ef
      Zheyu Ma authored
      For each device, the nosy driver allocates a pcilynx structure.
      A use-after-free might happen in the following scenario:
      
       1. Open nosy device for the first time and call ioctl with command
          NOSY_IOC_START, then a new client A will be malloced and added to
          doubly linked list.
       2. Open nosy device for the second time and call ioctl with command
          NOSY_IOC_START, then a new client B will be malloced and added to
          doubly linked list.
       3. Call ioctl with command NOSY_IOC_START for client A, then client A
          will be readded to the doubly linked list. Now the doubly linked
          list is messed up.
       4. Close the first nosy device and nosy_release will be called. In
          nosy_release, client A will be unlinked and freed.
       5. Close the second nosy device, and client A will be referenced,
          resulting in UAF.
      
      The root cause of this bug is that the element in the doubly linked list
      is reentered into the list.
      
      Fix this bug by adding a check before inserting a client.  If a client
      is already in the linked list, don't insert it.
      
      The following KASAN report reveals it:
      
         BUG: KASAN: use-after-free in nosy_release+0x1ea/0x210
         Write of size 8 at addr ffff888102ad7360 by task poc
         CPU: 3 PID: 337 Comm: poc Not tainted 5.12.0-rc5+ #6
         Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
         Call Trace:
           nosy_release+0x1ea/0x210
           __fput+0x1e2/0x840
           task_work_run+0xe8/0x180
           exit_to_user_mode_prepare+0x114/0x120
           syscall_exit_to_user_mode+0x1d/0x40
           entry_SYSCALL_64_after_hwframe+0x44/0xae
      
         Allocated by task 337:
           nosy_open+0x154/0x4d0
           misc_open+0x2ec/0x410
           chrdev_open+0x20d/0x5a0
           do_dentry_open+0x40f/0xe80
           path_openat+0x1cf9/0x37b0
           do_filp_open+0x16d/0x390
           do_sys_openat2+0x11d/0x360
           __x64_sys_open+0xfd/0x1a0
           do_syscall_64+0x33/0x40
           entry_SYSCALL_64_after_hwframe+0x44/0xae
      
         Freed by task 337:
           kfree+0x8f/0x210
           nosy_release+0x158/0x210
           __fput+0x1e2/0x840
           task_work_run+0xe8/0x180
           exit_to_user_mode_prepare+0x114/0x120
           syscall_exit_to_user_mode+0x1d/0x40
           entry_SYSCALL_64_after_hwframe+0x44/0xae
      
         The buggy address belongs to the object at ffff888102ad7300 which belongs to the cache kmalloc-128 of size 128
         The buggy address is located 96 bytes inside of 128-byte region [ffff888102ad7300, ffff888102ad7380)
      
      [ Modified to use 'list_empty()' inside proper lock  - Linus ]
      
      Link: https://lore.kernel.org/lkml/1617433116-5930-1-git-send-email-zheyuma97@gmail.com/
      
      
      Reported-and-tested-by: default avatar马哲宇 (Zheyu Ma) <zheyuma97@gmail.com>
      Signed-off-by: default avatarZheyu Ma <zheyuma97@gmail.com>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      829933ef
  3. Apr 03, 2021
  4. Apr 02, 2021
  5. Apr 01, 2021
    • Sean Christopherson's avatar
      kbuild: lto: Merge module sections if and only if CONFIG_LTO_CLANG is enabled · 6a3193cd
      Sean Christopherson authored
      
      Merge module sections only when using Clang LTO. With ld.bfd, merging
      sections does not appear to update the symbol tables for the module,
      e.g. 'readelf -s' shows the value that a symbol would have had, if
      sections were not merged. ld.lld does not show this problem.
      
      The stale symbol table breaks gdb's function disassembler, and presumably
      other things, e.g.
      
        gdb -batch -ex "file arch/x86/kvm/kvm.ko" -ex "disassemble kvm_init"
      
      reads the wrong bytes and dumps garbage.
      
      Fixes: dd277622 ("kbuild: lto: merge module sections")
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Tested-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210322234438.502582-1-seanjc@google.com
      6a3193cd
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 6905b1dc
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "It's a bit larger than I (and probably you) would like by the time we
        get to -rc6, but perhaps not entirely unexpected since the changes in
        the last merge window were larger than usual.
      
        x86:
         - Fixes for missing TLB flushes with TDP MMU
      
         - Fixes for race conditions in nested SVM
      
         - Fixes for lockdep splat with Xen emulation
      
         - Fix for kvmclock underflow
      
         - Fix srcdir != builddir builds
      
         - Other small cleanups
      
        ARM:
         - Fix GICv3 MMIO compatibility probing
      
         - Prevent guests from using the ARMv8.4 self-hosted tracing
           extension"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        selftests: kvm: Check that TSC page value is small after KVM_SET_CLOCK(0)
        KVM: x86: Prevent 'hv_clock->system_time' from going negative in kvm_guest_time_update()
        KVM: x86: disable interrupts while pvclock_gtod_sync_lock is taken
        KVM: x86: reduce pvclock_gtod_sync_lock critical sections
        KVM: SVM: ensure that EFER.SVME is set when running nested guest or on nested vmexit
        KVM: SVM: load control fields from VMCB12 before checking them
        KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
        KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping
        KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap
        KVM: make: Fix out-of-source module builds
        selftests: kvm: make hardware_disable_test less verbose
        KVM: x86/vPMU: Forbid writing to MSR_F15H_PERF MSRs when guest doesn't have X86_FEATURE_PERFCTR_CORE
        KVM: x86: remove unused declaration of kvm_write_tsc()
        KVM: clean up the unused argument
        tools/kvm_stat: Add restart delay
        KVM: arm64: Fix CPU interface MMIO compatibility detection
        KVM: arm64: Disable guest access to trace filter controls
        KVM: arm64: Hide system instruction access to Trace registers
      6905b1dc
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2021-04-02' of git://anongit.freedesktop.org/drm/drm · a80314c3
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Things have settled down in time for Easter, a random smattering of
        small fixes across a few drivers.
      
        I'm guessing though there might be some i915 and misc fixes out there
        I haven't gotten yet, but since today is a public holiday here, I'm
        sending this early so I can have the day off, I'll see if more
        requests come in and decide what to do with them later.
      
        amdgpu:
         - Polaris idle power fix
         - VM fix
         - Vangogh S3 fix
         - Fixes for non-4K page sizes
      
        amdkfd:
         - dqm fence memory corruption fix
      
        tegra:
         - lockdep warning fix
         - runtine PM reference fix
         - display controller fix
         - PLL Fix
      
        imx:
         - memory leak in error path fix
         - LDB driver channel registration fix
         - oob array warning in LDB driver
      
        exynos
         - unused header file removal"
      
      * tag 'drm-fixes-2021-04-02' of git://anongit.freedesktop.org/drm/drm:
        drm/amdgpu: check alignment on CPU page for bo map
        drm/amdgpu: Set a suitable dev_info.gart_page_size
        drm/amdgpu/vangogh: don't check for dpm in is_dpm_running when in suspend
        drm/amdkfd: dqm fence memory corruption
        drm/tegra: sor: Grab runtime PM reference across reset
        drm/tegra: dc: Restore coupling of display controllers
        gpu: host1x: Use different lock classes for each client
        drm/tegra: dc: Don't set PLL clock to 0Hz
        drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings()
        drm/amd/pm: no need to force MCLK to highest when no display connected
        drm/exynos/decon5433: Remove the unused include statements
        drm/imx: imx-ldb: fix out of bounds array access warning
        drm/imx: imx-ldb: Register LDB channel1 when it is the only channel to be used
        drm/imx: fix memory leak when fails to init
      a80314c3
    • Dave Airlie's avatar
      Merge tag 'imx-drm-fixes-2021-04-01' of git://git.pengutronix.de/git/pza/linux into drm-fixes · 6fdb8e5a
      Dave Airlie authored
      
      drm/imx: imx-drm-core and imx-ldb fixes
      
      Fix a memory leak in an error path during DRM device initialization,
      fix the LDB driver to register channel 1 even if channel 0 is unused,
      and fix an out of bounds array access warning in the LDB driver.
      
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      
      From: Philipp Zabel <p.zabel@pengutronix.de>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210401092235.GA13586@pengutronix.de
      6fdb8e5a
Loading