1. 12 Jul, 2019 1 commit
    • Christoph Hellwig's avatar
      mm: consolidate the get_user_pages* implementations · 050a9adc
      Christoph Hellwig authored
      Always build mm/gup.c so that we don't have to provide separate nommu
      stubs.  Also merge the get_user_pages_fast and __get_user_pages_fast stubs
      when HAVE_FAST_GUP into the main implementations, which will never call
      the fast path if HAVE_FAST_GUP is not set.
      This also ensures the new put_user_pages* helpers are available for nommu,
      as those are currently missing, which would create a problem as soon as we
      actually grew users for it.
      Link: http://lkml.kernel.org/r/20190625143715.1689-13-hch@lst.de
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  2. 15 May, 2019 1 commit
    • Dan Williams's avatar
      mm: shuffle initial free memory to improve memory-side-cache utilization · e900a918
      Dan Williams authored
      Patch series "mm: Randomize free memory", v10.
      This patch (of 3):
      Randomization of the page allocator improves the average utilization of
      a direct-mapped memory-side-cache.  Memory side caching is a platform
      capability that Linux has been previously exposed to in HPC
      (high-performance computing) environments on specialty platforms.  In
      that instance it was a smaller pool of high-bandwidth-memory relative to
      higher-capacity / lower-bandwidth DRAM.  Now, this capability is going
      to be found on general purpose server platforms where DRAM is a cache in
      front of higher latency persistent memory [1].
      Robert offered an explanation of the state of the art of Linux
      interactions with memory-side-caches [2], and I copy it here:
          It's been a problem in the HPC space:
          A kernel module called zonesort is available to try to help:
          and this abandoned patch series proposed that for the kernel:
          Dan's patch series doesn't attempt to ensure buffers won't conflict, but
          also reduces the chance that the buffers will. This will make performance
          more consistent, albeit slower than "optimal" (which is near impossible
          to attain in a general-purpose kernel).  That's better than forcing
          users to deploy remedies like:
              "To eliminate this gradual degradation, we have added a Stream
               measurement to the Node Health Check that follows each job;
               nodes are rebooted whenever their measured memory bandwidth
               falls below 300 GB/s."
      A replacement for zonesort was merged upstream in commit cc9aec03
      ("x86/numa_emulation: Introduce uniform split capability").  With this
      numa_emulation capability, memory can be split into cache sized
      ("near-memory" sized) numa nodes.  A bind operation to such a node, and
      disabling workloads on other nodes, enables full cache performance.
      However, once the workload exceeds the cache size then cache conflicts
      are unavoidable.  While HPC environments might be able to tolerate
      time-scheduling of cache sized workloads, for general purpose server
      platforms, the oversubscribed cache case will be the common case.
      The worst case scenario is that a server system owner benchmarks a
      workload at boot with an un-contended cache only to see that performance
      degrade over time, even below the average cache performance due to
      excessive conflicts.  Randomization clips the peaks and fills in the
      valleys of cache utilization to yield steady average performance.
      Here are some performance impact details of the patches:
      1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
         3X speedup in a contrived case that tries to force cache conflicts.
         The contrived cased used the numa_emulation capability to force an
         instance of the benchmark to be run in two of the near-memory sized
         numa nodes.  If both instances were placed on the same emulated they
         would fit and cause zero conflicts.  While on separate emulated nodes
         without randomization they underutilized the cache and conflicted
         unnecessarily due to the in-order allocation per node.
      2/ A well known Java server application benchmark was run with a heap
         size that exceeded cache size by 3X.  The cache conflict rate was 8%
         for the first run and degraded to 21% after page allocator aging.  With
         randomization enabled the rate levelled out at 11%.
      3/ A MongoDB workload did not observe measurable difference in
         cache-conflict rates, but the overall throughput dropped by 7% with
         randomization in one case.
      4/ Mel Gorman ran his suite of performance workloads with randomization
         enabled on platforms without a memory-side-cache and saw a mix of some
         improvements and some losses [3].
      While there is potentially significant improvement for applications that
      depend on low latency access across a wide working-set, the performance
      may be negligible to negative for other workloads.  For this reason the
      shuffle capability defaults to off unless a direct-mapped
      memory-side-cache is detected.  Even then, the page_alloc.shuffle=0
      parameter can be specified to disable the randomization on those systems.
      Outside of memory-side-cache utilization concerns there is potentially
      security benefit from randomization.  Some data exfiltration and
      return-oriented-programming attacks rely on the ability to infer the
      location of sensitive data objects.  The kernel page allocator, especially
      early in system boot, has predictable first-in-first out behavior for
      physical pages.  Pages are freed in physical address order when first
      Quoting Kees:
          "While we already have a base-address randomization
           (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
           memory layouts would certainly be using the predictability of
           allocation ordering (i.e. for attacks where the base address isn't
           important: only the relative positions between allocated memory).
           This is common in lots of heap-style attacks. They try to gain
           control over ordering by spraying allocations, etc.
           I'd really like to see this because it gives us something similar
           to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
      While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
      caches it leaves vast bulk of memory to be predictably in order allocated.
      However, it should be noted, the concrete security benefits are hard to
      quantify, and no known CVE is mitigated by this randomization.
      Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
      a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
      are initially populated with free memory at boot and at hotplug time.  Do
      this based on either the presence of a page_alloc.shuffle=Y command line
      parameter, or autodetection of a memory-side-cache (to be added in a
      follow-on patch).
      The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
      pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.  10,
      4MB this trades off randomization granularity for time spent shuffling.
      MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
      while still showing memory-side cache behavior improvements, and the
      expectation that the security implications of finer granularity
      randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.  The
      performance impact of the shuffling appears to be in the noise compared to
      other memory initialization work.
      This initial randomization can be undone over time so a follow-on patch is
      introduced to inject entropy on page free decisions.  It is reasonable to
      ask if the page free entropy is sufficient, but it is not enough due to
      the in-order initial freeing of pages.  At the start of that process
      putting page1 in front or behind page0 still keeps them close together,
      page2 is still near page1 and has a high chance of being adjacent.  As
      more pages are added ordering diversity improves, but there is still high
      page locality for the low address pages and this leads to no significant
      impact to the cache conflict rate.
      [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
      [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
      [3]: https://lkml.org/lkml/2018/10/12/309
      [dan.j.williams@intel.com: fix shuffle enable]
        Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
      [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
        Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Robert Elliott <elliott@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  3. 31 Oct, 2018 3 commits
  4. 07 Sep, 2018 1 commit
  5. 30 Aug, 2018 1 commit
  6. 08 Jun, 2018 1 commit
    • Mike Kravetz's avatar
      mm: restructure memfd code · 5d752600
      Mike Kravetz authored
      With the addition of memfd hugetlbfs support, we now have the situation
      where memfd depends on TMPFS -or- HUGETLBFS.  Previously, memfd was only
      supported on tmpfs, so it made sense that the code resided in shmem.c.
      In the current code, memfd is only functional if TMPFS is defined.  If
      HUGETLFS is defined and TMPFS is not defined, then memfd functionality
      will not be available for hugetlbfs.  This does not cause BUGs, just a
      lack of potentially desired functionality.
      Code is restructured in the following way:
      - include/linux/memfd.h is a new file containing memfd specific
        definitions previously contained in shmem_fs.h.
      - mm/memfd.c is a new file containing memfd specific code previously
        contained in shmem.c.
      - memfd specific code is removed from shmem_fs.h and shmem.c.
      - A new config option MEMFD_CREATE is added that is defined if TMPFS
        or HUGETLBFS is defined.
      No functional changes are made to the code: restructuring only.
      Link: http://lkml.kernel.org/r/20180415182119.4517-4-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  7. 06 Apr, 2018 1 commit
  8. 18 Nov, 2017 1 commit
  9. 16 Nov, 2017 1 commit
  10. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      How this work was done:
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
      All documentation files were explicitly excluded.
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
         For non */uapi/* files that summary was:
         SPDX license identifier                            # files
         GPL-2.0                                              11139
         and resulted in the first patch in this series.
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                        930
         and resulted in the second patch in this series.
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
         and that resulted in the third patch in this series.
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  11. 09 Sep, 2017 2 commits
  12. 20 Jun, 2017 1 commit
  13. 28 Feb, 2017 1 commit
  14. 25 Feb, 2017 1 commit
  15. 23 Feb, 2017 1 commit
    • Tim Chen's avatar
      mm/swap: add cache for swap slots allocation · 67afa38e
      Tim Chen authored
      We add per cpu caches for swap slots that can be allocated and freed
      quickly without the need to touch the swap info lock.
      Two separate caches are maintained for swap slots allocated and swap
      slots returned.  This is to allow the swap slots to be returned to the
      global pool in a batch so they will have a chance to be coaelesced with
      other slots in a cluster.  We do not reuse the slots that are returned
      right away, as it may increase fragmentation of the slots.
      The swap allocation cache is protected by a mutex as we may sleep when
      searching for empty slots in cache.  The swap free cache is protected by
      a spin lock as we cannot sleep in the free path.
      We refill the swap slots cache when we run out of slots, and we disable
      the swap slots cache and drain the slots if the global number of slots
      fall below a low watermark threshold.  We re-enable the cache agian when
      the slots available are above a high watermark.
      [ying.huang@intel.com: use raw_cpu_ptr over this_cpu_ptr for swap slots access]
      [tim.c.chen@linux.intel.com: add comments on locks in swap_slots.h]
        Link: http://lkml.kernel.org/r/20170118180327.GA24225@linux.intel.com
      Link: http://lkml.kernel.org/r/35de301a4eaa8daa2977de6e987f2c154385eb66.1484082593.git.tim.c.chen@linux.intel.com
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  16. 12 Oct, 2016 1 commit
    • Linus Torvalds's avatar
      Disable the __builtin_return_address() warning globally after all · ef6000b4
      Linus Torvalds authored
      This affectively reverts commit 377ccbb4
       ("Makefile: Mute warning
      for __builtin_return_address(>0) for tracing only") because it turns out
      that it really isn't tracing only - it's all over the tree.
      We already also had the warning disabled separately for mm/usercopy.c
      (which this commit also removes), and it turns out that we will also
      want to disable it for get_lock_parent_ip(), that is used for at least
      TRACE_IRQFLAGS.  Which (when enabled) ends up being all over the tree.
      Steven Rostedt had a patch that tried to limit it to just the config
      options that actually triggered this, but quite frankly, the extra
      complexity and abstraction just isn't worth it.  We have never actually
      had a case where the warning is actually useful, so let's just disable
      it globally and not worry about it.
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  17. 26 Jul, 2016 2 commits
  18. 21 May, 2016 1 commit
  19. 25 Mar, 2016 1 commit
    • Alexander Potapenko's avatar
      mm, kasan: SLAB support · 7ed2f9e6
      Alexander Potapenko authored
      Add KASAN hooks to SLAB allocator.
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  20. 22 Mar, 2016 1 commit
    • Dmitry Vyukov's avatar
      kernel: add kcov code coverage · 5c9a8750
      Dmitry Vyukov authored
      kcov provides code coverage collection for coverage-guided fuzzing
      (randomized testing).  Coverage-guided fuzzing is a testing technique
      that uses coverage feedback to determine new interesting inputs to a
      system.  A notable user-space example is AFL
      (http://lcamtuf.coredump.cx/afl/).  However, this technique is not
      widely used for kernel testing due to missing compiler and kernel
      kcov does not aim to collect as much coverage as possible.  It aims to
      collect more or less stable coverage that is function of syscall inputs.
      To achieve this goal it does not collect coverage in soft/hard
      interrupts and instrumentation of some inherently non-deterministic or
      non-interesting parts of kernel is disbled (e.g.  scheduler, locking).
      Currently there is a single coverage collection mode (tracing), but the
      API anticipates additional collection modes.  Initially I also
      implemented a second mode which exposes coverage in a fixed-size hash
      table of counters (what Quentin used in his original patch).  I've
      dropped the second mode for simplicity.
      This patch adds the necessary support on kernel side.  The complimentary
      compiler support was added in gcc revision 231296.
      We've used this support to build syzkaller system call fuzzer, which has
      found 90 kernel bugs in just 2 months:
      We've also found 30+ bugs in our internal systems with syzkaller.
      Another (yet unexplored) direction where kcov coverage would greatly
      help is more traditional "blob mutation".  For example, mounting a
      random blob as a filesystem, or receiving a random blob over wire.
      Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
      coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
      typical coverage can be just a dozen of basic blocks (e.g.  an invalid
      input).  In such context gcov becomes prohibitively expensive as
      reset/collect coverage steps depend on total number of basic
      blocks/edges in program (in case of kernel it is about 2M).  Cost of
      kcov depends only on number of executed basic blocks/edges.  On top of
      that, kernel requires per-thread coverage because there are always
      background threads and unrelated processes that also produce coverage.
      With inlined gcov instrumentation per-thread coverage is not possible.
      kcov exposes kernel PCs and control flow to user-space which is
      insecure.  But debugfs should not be mapped as user accessible.
      Based on a patch by Quentin Casasnovas.
      [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
      [akpm@linux-foundation.org: unbreak allmodconfig]
      [akpm@linux-foundation.org: follow x86 Makefile layout standards]
      Signed-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Drysdale <drysdale@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  21. 17 Mar, 2016 1 commit
    • Joonsoo Kim's avatar
      mm/page_ref: add tracepoint to track down page reference manipulation · 95813b8f
      Joonsoo Kim authored
      CMA allocation should be guaranteed to succeed by definition, but,
      unfortunately, it would be failed sometimes.  It is hard to track down
      the problem, because it is related to page reference manipulation and we
      don't have any facility to analyze it.
      This patch adds tracepoints to track down page reference manipulation.
      With it, we can find exact reason of failure and can fix the problem.
      Following is an example of tracepoint output.  (note: this example is
      stale version that printing flags as the number.  Recent version will
      print it as human readable string.)
      <...>-9018  [004]    92.678375: page_ref_set:         pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
      <...>-9018  [004]    92.678378: kernel_stack:
       => get_page_from_freelist (ffffffff81176659)
       => __alloc_pages_nodemask (ffffffff81176d22)
       => alloc_pages_vma (ffffffff811bf675)
       => handle_mm_fault (ffffffff8119e693)
       => __do_page_fault (ffffffff810631ea)
       => trace_do_page_fault (ffffffff81063543)
       => do_async_page_fault (ffffffff8105c40a)
       => async_page_fault (ffffffff817581d8)
      <...>-9018  [004]    92.678379: page_ref_mod:         pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
      <...>-9131  [001]    93.174468: test_pages_isolated:  start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
      <...>-9018  [004]    93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
       => release_pages (ffffffff8117c9e4)
       => free_pages_and_swap_cache (ffffffff811b0697)
       => tlb_flush_mmu_free (ffffffff81199616)
       => tlb_finish_mmu (ffffffff8119a62c)
       => exit_mmap (ffffffff811a53f7)
       => mmput (ffffffff81073f47)
       => do_exit (ffffffff810794e9)
       => do_group_exit (ffffffff81079def)
       => SyS_exit_group (ffffffff81079e74)
       => entry_SYSCALL_64_fastpath (ffffffff817560b6)
      This output shows that problem comes from exit path.  In exit path, to
      improve performance, pages are not freed immediately.  They are gathered
      and processed by batch.  During this process, migration cannot be
      possible and CMA allocation is failed.  This problem is hard to find
      without this page reference tracepoint facility.
      Enabling this feature bloat kernel text 30 KB in my configuration.
         text    data     bss     dec     hex filename
      12127327        2243616 1507328 15878271         f2487f vmlinux_disabled
      12157208        2258880 1507328 15923416         f2f8d8 vmlinux_enabled
      Note that, due to header file dependency problem between mm.h and
      tracepoint.h, this feature has to open code the static key functions for
      tracepoints.  Proposed by Steven Rostedt in following link.
      [arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()]
      [iamjoonsoo.kim@lge.com: fix build failure for xtensa]
      [akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  22. 15 Mar, 2016 1 commit
    • Laura Abbott's avatar
      mm/page_poison.c: enable PAGE_POISONING as a separate option · 8823b1db
      Laura Abbott authored
      Page poisoning is currently set up as a feature if architectures don't
      have architecture debug page_alloc to allow unmapping of pages.  It has
      uses apart from that though.  Clearing of the pages on free provides an
      increase in security as it helps to limit the risk of information leaks.
      Allow page poisoning to be enabled as a separate option independent of
      kernel_map pages since the two features do separate work.  Because of
      how hiberanation is implemented, the checks on alloc cannot occur if
      hibernation is enabled.  The runtime alloc checks can also be enabled
      with an option when !HIBERNATION.
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: default avatarLaura Abbott <labbott@fedoraproject.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  23. 10 Sep, 2015 1 commit
    • Vladimir Davydov's avatar
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov authored
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  24. 04 Sep, 2015 1 commit
  25. 16 Aug, 2015 1 commit
  26. 14 Apr, 2015 2 commits
    • Vladimir Murzin's avatar
      mm: move memtest under mm · 4a20799d
      Vladimir Murzin authored
      Memtest is a simple feature which fills the memory with a given set of
      patterns and validates memory contents, if bad memory regions is detected
      it reserves them via memblock API.  Since memblock API is widely used by
      other architectures this feature can be enabled outside of x86 world.
      This patch set promotes memtest to live under generic mm umbrella and
      enables memtest feature for arm/arm64.
      It was reported that this patch set was useful for tracking down an issue
      with some errant DMA on an arm64 platform.
      This patch (of 6):
      There is nothing platform dependent in the core memtest code, so other
      platforms might benefit from this feature too.
      [linux@roeck-us.net: MEMTEST depends on MEMBLOCK]
      Signed-off-by: default avatarVladimir Murzin <vladimir.murzin@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Sasha Levin's avatar
      mm: cma: debugfs interface · 28b24c1f
      Sasha Levin authored
      I've noticed that there is no interfaces exposed by CMA which would let me
      fuzz what's going on in there.
      This small patchset exposes some information out to userspace, plus adds
      the ability to trigger allocation and freeing from userspace.
      This patch (of 3):
      Implement a simple debugfs interface to expose information about CMA areas
      in the system.
      Useful for testing/sanity checks for CMA since it was impossible to
      previously retrieve this information in userspace.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  27. 18 Feb, 2015 1 commit
  28. 17 Feb, 2015 1 commit
    • Matthew Wilcox's avatar
      vfs: remove get_xip_mem · e748dcd0
      Matthew Wilcox authored
      All callers of get_xip_mem() are now gone.  Remove checks for it,
      initialisers of it, documentation of it and the only implementation of it.
       Also remove mm/filemap_xip.c as it is now empty.  Also remove
      documentation of the long-gone get_xip_page().
      Signed-off-by: default avatarMatthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  29. 14 Feb, 2015 2 commits
    • Andrey Ryabinin's avatar
      mm: slub: add kernel address sanitizer support for slub allocator · 0316bec2
      Andrey Ryabinin authored
      With this patch kasan will be able to catch bugs in memory allocated by
      slub.  Initially all objects in newly allocated slab page, marked as
      redzone.  Later, when allocation of slub object happens, requested by
      caller number of bytes marked as accessible, and the rest of the object
      (including slub's metadata) marked as redzone (inaccessible).
      We also mark object as accessible if ksize was called for this object.
      There is some places in kernel where ksize function is called to inquire
      size of really allocated area.  Such callers could validly access whole
      allocated memory, so it should be marked as accessible.
      Code in slub.c and slab_common.c files could validly access to object's
      metadata, so instrumentation for this files are disabled.
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Signed-off-by: default avatarDmitry Chernenkov <dmitryc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: default avatarAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Andrey Ryabinin's avatar
      kasan: add kernel address sanitizer infrastructure · 0b24becc
      Andrey Ryabinin authored
      Kernel Address sanitizer (KASan) is a dynamic memory error detector.  It
      provides fast and comprehensive solution for finding use-after-free and
      out-of-bounds bugs.
      KASAN uses compile-time instrumentation for checking every memory access,
      therefore GCC > v4.9.2 required.  v4.9.2 almost works, but has issues with
      putting symbol aliases into the wrong section, which breaks kasan
      instrumentation of globals.
      This patch only adds infrastructure for kernel address sanitizer.  It's
      not available for use yet.  The idea and some code was borrowed from [1].
      Basic idea:
      The main idea of KASAN is to use shadow memory to record whether each byte
      of memory is safe to access or not, and use compiler's instrumentation to
      check the shadow memory on each memory access.
      Address sanitizer uses 1/8 of the memory addressable in kernel for shadow
      memory and uses direct mapping with a scale and offset to translate a
      memory address to its corresponding shadow address.
      Here is function to translate address to corresponding shadow address:
           unsigned long kasan_mem_to_shadow(unsigned long addr)
                      return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET;
      So for every 8 bytes there is one corresponding byte of shadow memory.
      The following encoding used for each shadow byte: 0 means that all 8 bytes
      of the corresponding memory region are valid for access; k (1 <= k <= 7)
      means that the first k bytes are valid for access, and other (8 - k) bytes
      are not; Any negative value indicates that the entire 8-bytes are
      inaccessible.  Different negative values used to distinguish between
      different kinds of inaccessible memory (redzones, freed memory) (see
      To be able to detect accesses to bad memory we need a special compiler.
      Such compiler inserts a specific function calls (__asan_load*(addr),
      __asan_store*(addr)) before each memory access of size 1, 2, 4, 8 or 16.
      These functions check whether memory region is valid to access or not by
      checking corresponding shadow memory.  If access is not valid an error
      Historical background of the address sanitizer from Dmitry Vyukov:
      	"We've developed the set of tools, AddressSanitizer (Asan),
      	ThreadSanitizer and MemorySanitizer, for user space. We actively use
      	them for testing inside of Google (continuous testing, fuzzing,
      	running prod services). To date the tools have found more than 10'000
      	scary bugs in Chromium, Google internal codebase and various
      	open-source projects (Firefox, OpenSSL, gcc, clang, ffmpeg, MySQL and
      	lots of others): [2] [3] [4].
      	The tools are part of both gcc and clang compilers.
      	We have not yet done massive testing under the Kernel AddressSanitizer
      	(it's kind of chicken and egg problem, you need it to be upstream to
      	start applying it extensively). To date it has found about 50 bugs.
      	Bugs that we've found in upstream kernel are listed in [5].
      	We've also found ~20 bugs in out internal version of the kernel. Also
      	people from Samsung and Oracle have found some.
      	As others noted, the main feature of AddressSanitizer is its
      	performance due to inline compiler instrumentation and simple linear
      	shadow memory. User-space Asan has ~2x slowdown on computational
      	programs and ~2x memory consumption increase. Taking into account that
      	kernel usually consumes only small fraction of CPU and memory when
      	running real user-space programs, I would expect that kernel Asan will
      	have ~10-30% slowdown and similar memory consumption increase (when we
      	finish all tuning).
      	I agree that Asan can well replace kmemcheck. We have plans to start
      	working on Kernel MemorySanitizer that finds uses of unitialized
      	memory. Asan+Msan will provide feature-parity with kmemcheck. As
      	others noted, Asan will unlikely replace debug slab and pagealloc that
      	can be enabled at runtime. Asan uses compiler instrumentation, so even
      	if it is disabled, it still incurs visible overheads.
      	Asan technology is easily portable to other architectures. Compiler
      	instrumentation is fully portable. Runtime has some arch-dependent
      	parts like shadow mapping and atomic operation interception. They are
      	relatively easy to port."
      Comparison with other debugging features:
        - KASan can do almost everything that kmemcheck can.  KASan uses
          compile-time instrumentation, which makes it significantly faster than
          kmemcheck.  The only advantage of kmemcheck over KASan is detection of
          uninitialized memory reads.
          Some brief performance testing showed that kasan could be
          x500-x600 times faster than kmemcheck:
      $ netperf -l 30
      		MIGRATED TCP STREAM TEST from ( port 0 AF_INET to localhost ( port 0 AF_INET
      		Recv   Send    Send
      		Socket Socket  Message  Elapsed
      		Size   Size    Size     Time     Throughput
      		bytes  bytes   bytes    secs.    10^6bits/sec
      no debug:	87380  16384  16384    30.00    41624.72
      kasan inline:	87380  16384  16384    30.00    12870.54
      kasan outline:	87380  16384  16384    30.00    10586.39
      kmemcheck: 	87380  16384  16384    30.03      20.23
        - Also kmemcheck couldn't work on several CPUs.  It always sets
          number of CPUs to 1.  KASan doesn't have such limitation.
      	- KASan is slower than DEBUG_PAGEALLOC, but KASan works on sub-page
      	  granularity level, so it able to find more bugs.
      SLUB_DEBUG (poisoning, redzones):
      	- SLUB_DEBUG has lower overhead than KASan.
      	- SLUB_DEBUG in most cases are not able to detect bad reads,
      	  KASan able to detect both reads and writes.
      	- In some cases (e.g. redzone overwritten) SLUB_DEBUG detect
      	  bugs only on allocation/freeing of object. KASan catch
      	  bugs right before it will happen, so we always know exact
      	  place of first bad read/write.
      [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
      [2] https://code.google.com/p/address-sanitizer/wiki/FoundBugs
      [3] https://code.google.com/p/thread-sanitizer/wiki/FoundBugs
      [4] https://code.google.com/p/memory-sanitizer/wiki/FoundBugs
      [5] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel#Trophies
      Based on work by Andrey Konovalov.
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Acked-by: default avatarMichal Marek <mmarek@suse.cz>
      Signed-off-by: default avatarAndrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  30. 10 Feb, 2015 1 commit
    • Kirill A. Shutemov's avatar
      mm: replace remap_file_pages() syscall with emulation · c8d78c18
      Kirill A. Shutemov authored
      remap_file_pages(2) was invented to be able efficiently map parts of
      huge file into limited 32-bit virtual address space such as in database
      Nonlinear mappings are pain to support and it seems there's no
      legitimate use-cases nowadays since 64-bit systems are widely available.
      Let's drop it and get rid of all these special-cased code.
      The patch replaces the syscall with emulation which creates new VMA on
      each remap_file_pages(), unless they it can be merged with an adjacent
      I didn't find *any* real code that uses remap_file_pages(2) to test
      emulation impact on.  I've checked Debian code search and source of all
      packages in ALT Linux.  No real users: libc wrappers, mentions in
      strace, gdb, valgrind and this kind of stuff.
      There are few basic tests in LTP for the syscall.  They work just fine
      with emulation.
      To test performance impact, I've written small test case which
      demonstrate pretty much worst case scenario: map 4G shmfs file, write to
      begin of every page pgoff of the page, remap pages in reverse order,
      read every page.
      The test creates 1 million of VMAs if emulation is in use, so I had to
      set vm.max_map_count to 1100000 to avoid -ENOMEM.
      Before:		23.3 ( +-  4.31% ) seconds
      After:		43.9 ( +-  0.85% ) seconds
      Slowdown:	1.88x
      I believe we can live with that.
      Test case:
              #define _GNU_SOURCE
              #include <assert.h>
              #include <stdlib.h>
              #include <stdio.h>
              #include <sys/mman.h>
              #define MB	(1024UL * 1024)
              #define SIZE	(4096 * MB)
              int main(int argc, char **argv)
                      unsigned long *p;
                      long i, pass;
                      for (pass = 0; pass < 10; pass++) {
                              p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                              MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                              if (p == MAP_FAILED) {
                                      return -1;
                              for (i = 0; i < SIZE / 4096; i++)
                                      p[i * 4096 / sizeof(*p)] = i;
                              for (i = 0; i < SIZE / 4096; i++) {
                                      if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
                                                      0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
                                              return -1;
                              for (i = SIZE / 4096 - 1; i >= 0; i--)
                                      assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
                              munmap(p, SIZE);
                      return 0;
      [akpm@linux-foundation.org: fix spello]
      [sasha.levin@oracle.com: initialize populate before usage]
      [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
      Signed-off-by: default avatar"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Armin Rigo <arigo@tunes.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  31. 13 Dec, 2014 2 commits
    • Joonsoo Kim's avatar
      mm/page_owner: keep track of page owners · 48c96a36
      Joonsoo Kim authored
      This is the page owner tracking code which is introduced so far ago.  It
      is resident on Andrew's tree, though, nobody tried to upstream so it
      remain as is.  Our company uses this feature actively to debug memory leak
      or to find a memory hogger so I decide to upstream this feature.
      This functionality help us to know who allocates the page.  When
      allocating a page, we store some information about allocation in extra
      memory.  Later, if we need to know status of all pages, we can get and
      analyze it from this stored information.
      In previous version of this feature, extra memory is statically defined in
      struct page, but, in this version, extra memory is allocated outside of
      struct page.  It enables us to turn on/off this feature at boottime
      without considerable memory waste.
      Although we already have tracepoint for tracing page allocation/free,
      using it to analyze page owner is rather complex.  We need to enlarge the
      trace buffer for preventing overlapping until userspace program launched.
      And, launched program continually dump out the trace buffer for later
      analysis and it would change system behaviour with more possibility rather
      than just keeping it in memory, so bad for debug.
      Moreover, we can use page_owner feature further for various purposes.  For
      example, we can use it for fragmentation statistics implemented in this
      patch.  And, I also plan to implement some CMA failure debugging feature
      using this interface.
      I'd like to give the credit for all developers contributed this feature,
      but, it's not easy because I don't know exact history.  Sorry about that.
      Below is people who has "Signed-off-by" in the patches in Andrew's tree.
      Alexander Nyberg <alexn@dsv.su.se>
      Mel Gorman <mgorman@suse.de>
      Dave Hansen <dave@linux.vnet.ibm.com>
      Minchan Kim <minchan@kernel.org>
      Michal Nazarewicz <mina86@mina86.com>
      Andrew Morton <akpm@linux-foundation.org>
      Jungsoo Son <jungsoo.son@lge.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Joonsoo Kim's avatar
      mm/page_ext: resurrect struct page extending code for debugging · eefa864b
      Joonsoo Kim authored
      When we debug something, we'd like to insert some information to every
      page.  For this purpose, we sometimes modify struct page itself.  But,
      this has drawbacks.  First, it requires re-compile.  This makes us
      hesitate to use the powerful debug feature so development process is
      slowed down.  And, second, sometimes it is impossible to rebuild the
      kernel due to third party module dependency.  At third, system behaviour
      would be largely different after re-compile, because it changes size of
      struct page greatly and this structure is accessed by every part of
      kernel.  Keeping this as it is would be better to reproduce errornous
      This feature is intended to overcome above mentioned problems.  This
      feature allocates memory for extended data per page in certain place
      rather than the struct page itself.  This memory can be accessed by the
      accessor functions provided by this code.  During the boot process, it
      checks whether allocation of huge chunk of memory is needed or not.  If
      not, it avoids allocating memory at all.  With this advantage, we can
      include this feature into the kernel in default and can avoid rebuild and
      solve related problems.
      Until now, memcg uses this technique.  But, now, memcg decides to embed
      their variable to struct page itself and it's code to extend struct page
      has been removed.  I'd like to use this code to develop debug feature, so
      this patch resurrect it.
      To help these things to work well, this patch introduces two callbacks for
      clients.  One is the need callback which is mandatory if user wants to
      avoid useless memory allocation at boot-time.  The other is optional, init
      callback, which is used to do proper initialization after memory is
      allocated.  Detailed explanation about purpose of these functions is in
      code comment.  Please refer it.
      Others are completely same with previous extension code in memcg.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  32. 11 Dec, 2014 2 commits
    • Johannes Weiner's avatar
      mm: page_cgroup: rename file to mm/swap_cgroup.c · 5d1ea48b
      Johannes Weiner authored
      Now that the external page_cgroup data structure and its lookup is gone,
      the only code remaining in there is swap slot accounting.
      Rename it and move the conditional compilation into mm/Makefile.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Johannes Weiner's avatar
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner authored
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      Notable differences between the old and new API:
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>