Skip to content
  • Mike Rapoport's avatar
    mmap: make mlock_future_check() global · 6aeb2542
    Mike Rapoport authored
    Patch series "mm: introduce memfd_secret system call to create "secret" memory areas", v20.
    
    This is an implementation of "secret" mappings backed by a file
    descriptor.
    
    The file descriptor backing secret memory mappings is created using a
    dedicated memfd_secret system call The desired protection mode for the
    memory is configured using flags parameter of the system call.  The mmap()
    of the file descriptor created with memfd_secret() will create a "secret"
    memory mapping.  The pages in that mapping will be marked as not present
    in the direct map and will be present only in the page table of the owning
    mm.
    
    Although normally Linux userspace mappings are protected from other users,
    such secret mappings are useful for environments where a hostile tenant is
    trying to trick the kernel into giving them access to other tenants
    mappings.
    
    It's designed to provide the following protections:
    
    * Enhanced protection (in conjunction with all the other in-kernel
      attack prevention systems) against ROP attacks.  Seceretmem makes
      "simple" ROP insufficient to perform exfiltration, which increases the
      required complexity of the attack.  Along with other protections like
      the kernel stack size limit and address space layout randomization which
      make finding gadgets is really hard, absence of any in-kernel primitive
      for accessing secret memory means the one gadget ROP attack can't work.
      Since the only way to access secret memory is to reconstruct the missing
      mapping entry, the attacker has to recover the physical page and insert
      a PTE pointing to it in the kernel and then retrieve the contents.  That
      takes at least three gadgets which is a level of difficulty beyond most
      standard attacks.
    
    * Prevent cross-process secret userspace memory exposures.  Once the
      secret memory is allocated, the user can't accidentally pass it into the
      kernel to be transmitted somewhere.  The secreremem pages cannot be
      accessed via the direct map and they are disallowed in GUP.
    
    * Harden against exploited kernel flaws.  In order to access secretmem,
      a kernel-side attack would need to either walk the page tables and
      create new ones, or spawn a new privileged uiserspace process to perform
      secrets exfiltration using ptrace.
    
    In the future the secret mappings may be used as a mean to protect guest
    memory in a virtual machine host.
    
    For demonstration of secret memory usage we've created a userspace library
    
    https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git
    
    that does two things: the first is act as a preloader for openssl to
    redirect all the OPENSSL_malloc calls to secret memory meaning any secret
    keys get automatically protected this way and the other thing it does is
    expose the API to the user who needs it.  We anticipate that a lot of the
    use cases would be like the openssl one: many toolkits that deal with
    secret keys already have special handling for the memory to try to give
    them greater protection, so this would simply be pluggable into the
    toolkits without any need for user application modification.
    
    Hiding secret memory mappings behind an anonymous file allows usage of the
    page cache for tracking pages allocated for the "secret" mappings as well
    as using address_space_operations for e.g.  page migration callbacks.
    
    The anonymous file may be also used implicitly, like hugetlb files, to
    implement mmap(MAP_SECRET) and use the secret memory areas with "native"
    mm ABIs in the future.
    
    Removing of the pages from the direct map may cause its fragmentation on
    architectures that use large pages to map the physical memory which
    affects the system performance.  However, the original Kconfig text for
    CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "...  can
    improve the kernel's performance a tiny bit ..." (commit 00d1c5e0
    ("x86: add gbpages switches")) and the recent report [1] showed that "...
    although 1G mappings are a good default choice, there is no compelling
    evidence that it must be the only choice".  Hence, it is sufficient to
    have secretmem disabled by default with the ability of a system
    administrator to enable it at boot time.
    
    In addition, there is also a long term goal to improve management of the
    direct map.
    
    [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
    
    This patch (of 7):
    
    It will be used by the upcoming secret memory implementation.
    
    Link: https://lkml.kernel.org/r/20210518072034.31572-1-rppt@kernel.org
    Link: https://lkml.kernel.org/r/20210518072034.31572-2-rppt@kernel.org
    
    
    Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
    Acked-by: default avatarJames Bottomley <James.Bottomley@HansenPartnership.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christopher Lameter <cl@linux.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Elena Reshetova <elena.reshetova@intel.com>
    Cc: Hagen Paul Pfeifer <hagen@jauu.net>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: James Bottomley <jejb@linux.ibm.com>
    Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael Kerrisk <mtk.manpages@gmail.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Palmer Dabbelt <palmerdabbelt@google.com>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Tycho Andersen <tycho@tycho.ws>
    Cc: Will Deacon <will@kernel.org>
    Cc: kernel test robot <lkp@intel.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    6aeb2542