1. 23 Sep, 2017 1 commit
    • Josh Poimboeuf's avatar
      x86/asm: Fix inline asm call constraints for Clang · f5caf621
      Josh Poimboeuf authored
      For inline asm statements which have a CALL instruction, we list the
      stack pointer as a constraint to convince GCC to ensure the frame
      pointer is set up first:
      
        static inline void foo()
        {
      	register void *__sp asm(_ASM_SP);
      	asm("call bar" : "+r" (__sp))
        }
      
      Unfortunately, that pattern causes Clang to corrupt the stack pointer.
      
      The fix is easy: convert the stack pointer register variable to a global
      variable.
      
      It should be noted that the end result is different based on the GCC
      version.  With GCC 6.4, this patch has exactly the same result as
      before:
      
      	defconfig	defconfig-nofp	distro		distro-nofp
       before	9820389		9491555		8816046		8516940
       after	9820389		9491555		8816046		8516940
      
      With GCC 7.2, however, GCC's behavior has changed.  It now changes its
      behavior based on the conversion of the register variable to a global.
      That somehow convinces it to *always* set up the frame pointer before
      inserting *any* inline asm.  (Therefore, listing the variable as an
      output constraint is a no-op and is no longer necessary.)  It's a bit
      overkill, but the performance impact should be negligible.  And in fact,
      there's a nice improvement with frame pointers disabled:
      
      	defconfig	defconfig-nofp	distro		distro-nofp
       before	9796316		9468236		9076191		8790305
       after	9796957		9464267		9076381		8785949
      
      So in summary, while listing the stack pointer as an output constraint
      is no longer necessary for newer versions of GCC, it's still needed for
      older versions.
      Suggested-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: default avatarMatthias Kaehlcke <mka@chromium.org>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/3db862e970c432ae823cf515c52b54fec8270e0e.1505942196.git.jpoimboe@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f5caf621
  2. 31 Aug, 2017 1 commit
  3. 29 Aug, 2017 2 commits
  4. 13 Jun, 2017 1 commit
  5. 14 Mar, 2017 2 commits
  6. 02 Mar, 2017 1 commit
  7. 15 Dec, 2016 1 commit
  8. 25 Oct, 2016 1 commit
    • Josh Poimboeuf's avatar
      x86/dumpstack: Remove kernel text addresses from stack dump · bb5e5ce5
      Josh Poimboeuf authored
      Printing kernel text addresses in stack dumps is of questionable value,
      especially now that address randomization is becoming common.
      
      It can be a security issue because it leaks kernel addresses.  It also
      affects the usefulness of the stack dump.  Linus says:
      
        "I actually spend time cleaning up commit messages in logs, because
        useless data that isn't actually information (random hex numbers) is
        actively detrimental.
      
        It makes commit logs less legible.
      
        It also makes it harder to parse dumps.
      
        It's not useful. That makes it actively bad.
      
        I probably look at more oops reports than most people. I have not
        found the hex numbers useful for the last five years, because they are
        just randomized crap.
      
        The stack content thing just makes code scroll off the screen etc, for
        example."
      
      The only real downside to removing these addresses is that they can be
      used to disambiguate duplicate symbol names.  However such cases are
      rare, and the context of the stack dump should be enough to be able to
      figure it out.
      
      There's now a 'faddr2line' script which can be used to convert a
      function address to a file name and line:
      
        $ ./scripts/faddr2line ~/k/vmlinux write_sysrq_trigger+0x51/0x60
        write_sysrq_trigger+0x51/0x60:
        write_sysrq_trigger at drivers/tty/sysrq.c:1098
      
      Or gdb can be used:
      
        $ echo "list *write_sysrq_trigger+0x51" |gdb ~/k/vmlinux |grep "is in"
        (gdb) 0xffffffff815b5d83 is in driver_probe_device (/home/jpoimboe/git/linux/drivers/base/dd.c:378).
      
      (But note that when there are duplicate symbol names, gdb will only show
      the first symbol it finds.  faddr2line is recommended over gdb because
      it handles duplicates and it also does function size checking.)
      
      Here's an example of what a stack dump looks like after this change:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: sysrq_handle_crash+0x45/0x80
        PGD 36bfa067 [   29.650644] PUD 7aca3067
        Oops: 0002 [#1] PREEMPT SMP
        Modules linked in: ...
        CPU: 1 PID: 786 Comm: bash Tainted: G            E   4.9.0-rc1+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
        task: ffff880078582a40 task.stack: ffffc90000ba8000
        RIP: 0010:sysrq_handle_crash+0x45/0x80
        RSP: 0018:ffffc90000babdc8 EFLAGS: 00010296
        RAX: ffff880078582a40 RBX: 0000000000000063 RCX: 0000000000000001
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000292
        RBP: ffffc90000babdc8 R08: 0000000b31866061 R09: 0000000000000000
        R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
        R13: 0000000000000007 R14: ffffffff81ee8680 R15: 0000000000000000
        FS:  00007ffb43869700(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000007a3e9000 CR4: 00000000001406e0
        Stack:
         ffffc90000babe00 ffffffff81572d08 ffffffff81572bd5 0000000000000002
         0000000000000000 ffff880079606600 00007ffb4386e000 ffffc90000babe20
         ffffffff81573201 ffff880036a3fd00 fffffffffffffffb ffffc90000babe40
        Call Trace:
         __handle_sysrq+0x138/0x220
         ? __handle_sysrq+0x5/0x220
         write_sysrq_trigger+0x51/0x60
         proc_reg_write+0x42/0x70
         __vfs_write+0x37/0x140
         ? preempt_count_sub+0xa1/0x100
         ? __sb_start_write+0xf5/0x210
         ? vfs_write+0x183/0x1a0
         vfs_write+0xb8/0x1a0
         SyS_write+0x58/0xc0
         entry_SYSCALL_64_fastpath+0x1f/0xc2
        RIP: 0033:0x7ffb42f55940
        RSP: 002b:00007ffd33bb6b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        RAX: ffffffffffffffda RBX: 0000000000000046 RCX: 00007ffb42f55940
        RDX: 0000000000000002 RSI: 00007ffb4386e000 RDI: 0000000000000001
        RBP: 0000000000000011 R08: 00007ffb4321ea40 R09: 00007ffb43869700
        R10: 00007ffb43869700 R11: 0000000000000246 R12: 0000000000778a10
        R13: 00007ffd33bb5c00 R14: 0000000000000007 R15: 0000000000000010
        Code: 34 e8 d0 34 bc ff 48 c7 c2 3b 2b 57 81 be 01 00 00 00 48 c7 c7 e0 dd e5 81 e8 a8 55 ba ff c7 05 0e 3f de 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 e8 4c 49 bc ff 84 c0 75 c3 48 c7
        RIP: sysrq_handle_crash+0x45/0x80 RSP: ffffc90000babdc8
        CR2: 0000000000000000
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/69329cb29b8f324bb5fcea14d61d224807fb6488.1477405374.git.jpoimboe@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      bb5e5ce5
  9. 28 Sep, 2016 1 commit
  10. 19 Sep, 2016 1 commit
  11. 09 Sep, 2016 1 commit
  12. 08 Sep, 2016 1 commit
  13. 26 Jul, 2016 1 commit
  14. 15 Jul, 2016 3 commits
  15. 20 May, 2016 1 commit
  16. 03 Mar, 2016 1 commit
    • Dave Hansen's avatar
      x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA · e2155543
      Dave Hansen authored
      Andrey Wagin reported that a simple test case was broken by:
      
      	2b5f7d013fc ("mm/core, x86/mm/pkeys: Add execute-only protection keys support")
      
      This test case creates an unreadable VMA and my patch assumed
      that all writes must be to readable VMAs.
      
      The simplest fix for this is to remove the pkey-related bits
      in access_error().  For execute-only support, I believe the
      existing version is sufficient because the permissions we
      are trying to enforce are entirely expressed in vma->vm_flags.
      We just depend on pkeys to get *an* exception, it does not
      matter that PF_PK was set, or even what state PKRU is in.
      
      I will re-add the necessary bits with the full pkeys
      implementation that includes the new syscalls.
      
      The three cases that matter are:
      
      1. If a write to an execute-only VMA occurs, we will see PF_WRITE
         set, but !VM_WRITE on the VMA, and return 1.  All execute-only
         VMAs have VM_WRITE clear by definition.
      2. If a read occurs on a present PTE, we will fall in to the "read,
         present" case and return 1.
      3. If a read occurs to a non-present PTE, we will miss the "read,
         not present" case, because the execute-only VMA will have
         VM_EXEC set, and we will properly return 0 allowing the PTE to
         be populated.
      
      Test program:
      
       int main()
       {
      	int *p;
      	p = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      	p[0] = 1;
      
      	return 0;
       }
      
      Reported-by: Andrey Wagin <avagin@gmail.com>,
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Cc: linux-next@vger.kernel.org
      Fixes: 62b5f7d0 ("mm/core, x86/mm/pkeys: Add execute-only protection keys support")
      Link: http://lkml.kernel.org/r/20160301194133.65D0110C@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e2155543
  17. 18 Feb, 2016 9 commits
    • Dave Hansen's avatar
      mm/core, x86/mm/pkeys: Add execute-only protection keys support · 62b5f7d0
      Dave Hansen authored
      Protection keys provide new page-based protection in hardware.
      But, they have an interesting attribute: they only affect data
      accesses and never affect instruction fetches.  That means that
      if we set up some memory which is set as "access-disabled" via
      protection keys, we can still execute from it.
      
      This patch uses protection keys to set up mappings to do just that.
      If a user calls:
      
      	mmap(..., PROT_EXEC);
      or
      	mprotect(ptr, sz, PROT_EXEC);
      
      (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
      notice this, and set a special protection key on the memory.  It
      also sets the appropriate bits in the Protection Keys User Rights
      (PKRU) register so that the memory becomes unreadable and
      unwritable.
      
      I haven't found any userspace that does this today.  With this
      facility in place, we expect userspace to move to use it
      eventually.  Userspace _could_ start doing this today.  Any
      PROT_EXEC calls get converted to PROT_READ inside the kernel, and
      would transparently be upgraded to "true" PROT_EXEC with this
      code.  IOW, userspace never has to do any PROT_EXEC runtime
      detection.
      
      This feature provides enhanced protection against leaking
      executable memory contents.  This helps thwart attacks which are
      attempting to find ROP gadgets on the fly.
      
      But, the security provided by this approach is not comprehensive.
      The PKRU register which controls access permissions is a normal
      user register writable from unprivileged userspace.  An attacker
      who can execute the 'wrpkru' instruction can easily disable the
      protection provided by this feature.
      
      The protection key that is used for execute-only support is
      permanently dedicated at compile time.  This is fine for now
      because there is currently no API to set a protection key other
      than this one.
      
      Despite there being a constant PKRU value across the entire
      system, we do not set it unless this feature is in use in a
      process.  That is to preserve the PKRU XSAVE 'init state',
      which can lead to faster context switches.
      
      PKRU *is* a user register and the kernel is modifying it.  That
      means that code doing:
      
      	pkru = rdpkru()
      	pkru |= 0x100;
      	mmap(..., PROT_EXEC);
      	wrpkru(pkru);
      
      could lose the bits in PKRU that enforce execute-only
      permissions.  To avoid this, we suggest avoiding ever calling
      mmap() or mprotect() when the PKRU value is expected to be
      unstable.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: keescook@google.com
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      62b5f7d0
    • Dave Hansen's avatar
      mm/core, x86/mm/pkeys: Differentiate instruction fetches · d61172b4
      Dave Hansen authored
      As discussed earlier, we attempt to enforce protection keys in
      software.
      
      However, the code checks all faults to ensure that they are not
      violating protection key permissions.  It was assumed that all
      faults are either write faults where we check PKRU[key].WD (write
      disable) or read faults where we check the AD (access disable)
      bit.
      
      But, there is a third category of faults for protection keys:
      instruction faults.  Instruction faults never run afoul of
      protection keys because they do not affect instruction fetches.
      
      So, plumb the PF_INSTR bit down in to the
      arch_vma_access_permitted() function where we do the protection
      key checks.
      
      We also add a new FAULT_FLAG_INSTRUCTION.  This is because
      handle_mm_fault() is not passed the architecture-specific
      error_code where we keep PF_INSTR, so we need to encode the
      instruction fetch information in to the arch-generic fault
      flags.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d61172b4
    • Dave Hansen's avatar
      x86/mm/pkeys: Optimize fault handling in access_error() · 07f146f5
      Dave Hansen authored
      We might not strictly have to make modifictions to
      access_error() to check the VMA here.
      
      If we do not, we will do this:
      
       1. app sets VMA pkey to K
       2. app touches a !present page
       3. do_page_fault(), allocates and maps page, sets pte.pkey=K
       4. return to userspace
       5. touch instruction reexecutes, but triggers PF_PK
       6. do PKEY signal
      
      What happens with this patch applied:
      
       1. app sets VMA pkey to K
       2. app touches a !present page
       3. do_page_fault() notices that K is inaccessible
       4. do PKEY signal
      
      We basically skip the fault that does an allocation.
      
      So what this lets us do is protect areas from even being
      *populated* unless it is accessible according to protection
      keys.  That seems handy to me and makes protection keys work
      more like an mprotect()'d mapping.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210222.EBB63D8C@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      07f146f5
    • Dave Hansen's avatar
      mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys · 33a709b2
      Dave Hansen authored
      Today, for normal faults and page table walks, we check the VMA
      and/or PTE to ensure that it is compatible with the action.  For
      instance, if we get a write fault on a non-writeable VMA, we
      SIGSEGV.
      
      We try to do the same thing for protection keys.  Basically, we
      try to make sure that if a user does this:
      
      	mprotect(ptr, size, PROT_NONE);
      	*ptr = foo;
      
      they see the same effects with protection keys when they do this:
      
      	mprotect(ptr, size, PROT_READ|PROT_WRITE);
      	set_pkey(ptr, size, 4);
      	wrpkru(0xffffff3f); // access disable pkey 4
      	*ptr = foo;
      
      The state to do that checking is in the VMA, but we also
      sometimes have to do it on the page tables only, like when doing
      a get_user_pages_fast() where we have no VMA.
      
      We add two functions and expose them to generic code:
      
      	arch_pte_access_permitted(pte_flags, write)
      	arch_vma_access_permitted(vma, write)
      
      These are, of course, backed up in x86 arch code with checks
      against the PTE or VMA's protection key.
      
      But, there are also cases where we do not want to respect
      protection keys.  When we ptrace(), for instance, we do not want
      to apply the tracer's PKRU permissions to the PTEs from the
      process being traced.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      33a709b2
    • Dave Hansen's avatar
      x86/mm/pkeys: Fill in pkey field in siginfo · 019132ff
      Dave Hansen authored
      This fills in the new siginfo field: si_pkey to indicate to
      userspace which protection key was set on the PTE that we faulted
      on.
      
      Note though that *ALL* protection key faults have to be generated
      by a valid, present PTE at some point.  But this code does no PTE
      lookups which seeds odd.  The reason is that we take advantage of
      the way we generate PTEs from VMAs.  All PTEs under a VMA share
      some attributes.  For instance, they are _all_ either PROT_READ
      *OR* PROT_NONE.  They also always share a protection key, so we
      never have to walk the page tables; we just use the VMA.
      
      Note that _pkey is a 64-bit value.  The current hardware only
      supports 4-bit protection keys.  We do this because there is
      _plenty_ of space in _sigfault and it is possible that future
      processors would support more than 4 bits of protection keys.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210213.ABC488FA@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      019132ff
    • Dave Hansen's avatar
      x86/mm/pkeys: Pass VMA down in to fault signal generation code · 7b2d0dba
      Dave Hansen authored
      During a page fault, we look up the VMA to ensure that the fault
      is in a region with a valid mapping.  But, in the top-level page
      fault code we don't need the VMA for much else.  Once we have
      decided that an access is bad, we are going to send a signal no
      matter what and do not need the VMA any more.  So we do not pass
      it down in to the signal generation code.
      
      But, for protection keys, we need the VMA.  It tells us *which*
      protection key we violated if we get a PF_PK.  So, we need to
      pass the VMA down and fill in siginfo->si_pkey.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210211.AD3B36A3@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7b2d0dba
    • Dave Hansen's avatar
      x86/mm/pkeys: Add new 'PF_PK' page fault error code bit · b3ecd515
      Dave Hansen authored
      Note: "PK" is how the Intel SDM refers to this bit, so we also
      use that nomenclature.
      
      This only defines the bit, it does not plumb it anywhere to be
      handled.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210207.DA7B43E6@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b3ecd515
    • Tony Luck's avatar
      x86/mm: Expand the exception table logic to allow new handling options · 548acf19
      Tony Luck authored
      Huge amounts of help from  Andy Lutomirski and Borislav Petkov to
      produce this. Andy provided the inspiration to add classes to the
      exception table with a clever bit-squeezing trick, Boris pointed
      out how much cleaner it would all be if we just had a new field.
      
      Linus Torvalds blessed the expansion with:
      
        ' I'd rather not be clever in order to save just a tiny amount of space
          in the exception table, which isn't really criticial for anybody. '
      
      The third field is another relative function pointer, this one to a
      handler that executes the actions.
      
      We start out with three handlers:
      
       1: Legacy - just jumps the to fixup IP
       2: Fault - provide the trap number in %ax to the fixup code
       3: Cleaned up legacy for the uaccess error hack
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/f6af78fcbd348cf4939875cfda9c19689b5e50b8.1455732970.git.tony.luck@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      548acf19
    • Toshi Kani's avatar
      x86/mm: Fix vmalloc_fault() to handle large pages properly · f4eafd8b
      Toshi Kani authored
      A kernel page fault oops with the callstack below was observed
      when a read syscall was made to a pmem device after a huge amount
      (>512GB) of vmalloc ranges was allocated by ioremap() on a x86_64
      system:
      
           BUG: unable to handle kernel paging request at ffff880840000ff8
           IP: vmalloc_fault+0x1be/0x300
           PGD c7f03a067 PUD 0
           Oops: 0000 [#1] SM
           Call Trace:
              __do_page_fault+0x285/0x3e0
              do_page_fault+0x2f/0x80
              ? put_prev_entity+0x35/0x7a0
              page_fault+0x28/0x30
              ? memcpy_erms+0x6/0x10
              ? schedule+0x35/0x80
              ? pmem_rw_bytes+0x6a/0x190 [nd_pmem]
              ? schedule_timeout+0x183/0x240
              btt_log_read+0x63/0x140 [nd_btt]
               :
              ? __symbol_put+0x60/0x60
              ? kernel_read+0x50/0x80
              SyS_finit_module+0xb9/0xf0
              entry_SYSCALL_64_fastpath+0x1a/0xa4
      
      Since v4.1, ioremap() supports large page (pud/pmd) mappings in
      x86_64 and PAE.  vmalloc_fault() however assumes that the vmalloc
      range is limited to pte mappings.
      
      vmalloc faults do not normally happen in ioremap'd ranges since
      ioremap() sets up the kernel page tables, which are shared by
      user processes.  pgd_ctor() sets the kernel's PGD entries to
      user's during fork().  When allocation of the vmalloc ranges
      crosses a 512GB boundary, ioremap() allocates a new pud table
      and updates the kernel PGD entry to point it.  If user process's
      PGD entry does not have this update yet, a read/write syscall
      to the range will cause a vmalloc fault, which hits the Oops
      above as it does not handle a large page properly.
      
      Following changes are made to vmalloc_fault().
      
      64-bit:
      
       - No change for the PGD sync operation as it handles large
         pages already.
       - Add pud_huge() and pmd_huge() to the validation code to
         handle large pages.
       - Change pud_page_vaddr() to pud_pfn() since an ioremap range
         is not directly mapped (while the if-statement still works
         with a bogus addr).
       - Change pmd_page() to pmd_pfn() since an ioremap range is not
         backed by struct page (while the if-statement still works
         with a bogus addr).
      
      32-bit:
       - No change for the sync operation since the index3 PGD entry
         covers the entire vmalloc range, which is always valid.
         (A separate change to sync PGD entry is necessary if this
          memory layout is changed regardless of the page size.)
       - Add pmd_huge() to the validation code to handle large pages.
         This is for completeness since vmalloc_fault() won't happen
         in ioremap'd ranges as its PGD entry is always valid.
      Reported-by: default avatarHenning Schild <henning.schild@siemens.com>
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Acked-by: default avatarBorislav Petkov <bp@alien8.de>
      Cc: <stable@vger.kernel.org> # 4.1+
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: linux-mm@kvack.org
      Cc: linux-nvdimm@lists.01.org
      Link: http://lkml.kernel.org/r/1455758214-24623-1-git-send-email-toshi.kani@hpe.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f4eafd8b
  18. 31 Jul, 2015 2 commits
  19. 19 May, 2015 1 commit
    • David Hildenbrand's avatar
      mm/fault, arch: Use pagefault_disable() to check for disabled pagefaults in the handler · 70ffdb93
      David Hildenbrand authored
      Introduce faulthandler_disabled() and use it to check for irq context and
      disabled pagefaults (via pagefault_disable()) in the pagefault handlers.
      
      Please note that we keep the in_atomic() checks in place - to detect
      whether in irq context (in which case preemption is always properly
      disabled).
      
      In contrast, preempt_disable() should never be used to disable pagefaults.
      With !CONFIG_PREEMPT_COUNT, preempt_disable() doesn't modify the preempt
      counter, and therefore the result of in_atomic() differs.
      We validate that condition by using might_fault() checks when calling
      might_sleep().
      
      Therefore, add a comment to faulthandler_disabled(), describing why this
      is needed.
      
      faulthandler_disabled() and pagefault_disable() are defined in
      linux/uaccess.h, so let's properly add that include to all relevant files.
      
      This patch is based on a patch from Thomas Gleixner.
      Reviewed-and-tested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David.Laight@ACULAB.COM
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: benh@kernel.crashing.org
      Cc: bigeasy@linutronix.de
      Cc: borntraeger@de.ibm.com
      Cc: daniel.vetter@intel.com
      Cc: heiko.carstens@de.ibm.com
      Cc: herbert@gondor.apana.org.au
      Cc: hocko@suse.cz
      Cc: hughd@google.com
      Cc: mst@redhat.com
      Cc: paulus@samba.org
      Cc: ralf@linux-mips.org
      Cc: schwidefsky@de.ibm.com
      Cc: yang.shi@windriver.com
      Link: http://lkml.kernel.org/r/1431359540-32227-7-git-send-email-dahi@linux.vnet.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      70ffdb93
  20. 23 Mar, 2015 2 commits
  21. 04 Feb, 2015 1 commit
  22. 29 Jan, 2015 1 commit
    • Linus Torvalds's avatar
      vm: add VM_FAULT_SIGSEGV handling support · 33692f27
      Linus Torvalds authored
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: default avatarTakashi Iwai <tiwai@suse.de>
      Tested-by: default avatarJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33692f27
  23. 17 Dec, 2014 1 commit
  24. 15 Dec, 2014 2 commits
    • Linus Torvalds's avatar
      x86: mm: consolidate VM_FAULT_RETRY handling · 26178ec1
      Linus Torvalds authored
      The VM_FAULT_RETRY handling was confusing and incorrect for the case of
      returning to kernel mode.  We need to handle the exception table fixup
      if we return to kernel mode due to a fatal signal - it will basically
      look to the kernel user mode access like the access failed due to the VM
      going away from udner it.  Which is correct - the process is dying - and
      avoids the whole "repeat endless kernel page faults" case.
      
      Handling the VM_FAULT_RETRY early and in just one place also simplifies
      the mmap_sem handling, since once we've taken care of VM_FAULT_RETRY we
      know that we can just drop the lock.  The remaining accounting and
      possible error handling is thread-local and does not need the mmap_sem.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26178ec1
    • Linus Torvalds's avatar
      x86: mm: move mmap_sem unlock from mm_fault_error() to caller · 7fb08eca
      Linus Torvalds authored
      This replaces four copies in various stages of mm_fault_error() handling
      with just a single one.  It will also allow for more natural placement
      of the unlocking after some further cleanup.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7fb08eca
  25. 23 Sep, 2014 1 commit
    • David Vrabel's avatar
      x86: skip check for spurious faults for non-present faults · 31668511
      David Vrabel authored
      If a fault on a kernel address is due to a non-present page, then it
      cannot be the result of stale TLB entry from a protection change (RO
      to RW or NX to X).  Thus the pagetable walk in spurious_fault() can be
      skipped.
      
      See the initial if in spurious_fault() and the tests in
      spurious_fault_check()) for the set of possible error codes checked
      for spurious faults.  These are:
      
               IRUWP
      Before   x00xx && ( 1xxxx || xxx1x )
      After  ( 10001 || 00011 ) && ( 1xxxx || xxx1x )
      
      Thus the new condition is a subset of the previous one, excluding only
      non-present faults (I == 1 and W == 1 are mutually exclusive).
      
      This avoids spurious_fault() oopsing in some cases if the pagetables
      it attempts to walk are not accessible.  This obscures the location of
      the original fault.
      
      This also fixes a crash with Xen PV guests when they access entries in
      the M2P corresponding to device MMIO regions.  The M2P is mapped
      (read-only) by Xen into the kernel address space of the guest and this
      mapping may contains holes for non-RAM regions.  Read faults will
      result in calls to spurious_fault(), but because the page tables for
      the M2P mappings are not accessible by the guest the pagetable walk
      would fault.
      
      This was not normally a problem as MMIO mappings would not normally
      result in a M2P lookup because of the use of the _PAGE_IOMAP bit the
      PTE.  However, removing the _PAGE_IOMAP bit requires M2P lookups for
      MMIO mappings as well.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reported-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Tested-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      31668511