1. 16 Jan, 2016 8 commits
    • Dan Williams's avatar
      mm: skip memory block registration for ZONE_DEVICE · 260ae3f7
      Dan Williams authored
      
      
      Prevent userspace from trying and failing to online ZONE_DEVICE pages
      which are meant to never be onlined.
      
      For example on platforms with a udev rule like the following:
      
        SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"
      
      ...will generate futile attempts to online the ZONE_DEVICE sections.
      Example kernel messages:
      
          Built 1 zonelists in Node order, mobility grouping on.  Total pages: 1004747
          Policy zone: Normal
          online_pages [mem 0x248000000-0x24fffffff] failed
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      260ae3f7
    • Kirill A. Shutemov's avatar
      mm: prepare page_referenced() and page_idle to new THP refcounting · b20ce5e0
      Kirill A. Shutemov authored
      
      
      Both page_referenced() and page_idle_clear_pte_refs_one() assume that
      THP can only be mapped with PMD, so there's no reason to look on PTEs
      for PageTransHuge() pages.  That's no true anymore: THP can be mapped
      with PTEs too.
      
      The patch removes PageTransHuge() test from the functions and opencode
      page table check.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b20ce5e0
    • Kirill A. Shutemov's avatar
      thp: introduce deferred_split_huge_page() · 9a982250
      Kirill A. Shutemov authored
      
      
      Currently we don't split huge page on partial unmap.  It's not an ideal
      situation.  It can lead to memory overhead.
      
      Furtunately, we can detect partial unmap on page_remove_rmap().  But we
      cannot call split_huge_page() from there due to locking context.
      
      It's also counterproductive to do directly from munmap() codepath: in
      many cases we will hit this from exit(2) and splitting the huge page
      just to free it up in small pages is not what we really want.
      
      The patch introduce deferred_split_huge_page() which put the huge page
      into queue for splitting.  The splitting itself will happen when we get
      memory pressure via shrinker interface.  The page will be dropped from
      list on freeing through compound page destructor.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a982250
    • Naoya Horiguchi's avatar
      mm: hwpoison: adjust for new thp refcounting · 4e41a30c
      Naoya Horiguchi authored
      
      
      Some mm-related BUG_ON()s could trigger from hwpoison code due to recent
      changes in thp refcounting rule.  This patch fixes them up.
      
      In the new refcounting, we no longer use tail->_mapcount to keep tail's
      refcount, and thereby we can simplify get/put_hwpoison_page().
      
      And another change is that tail's refcount is not transferred to the raw
      page during thp split (more precisely, in new rule we don't take
      refcount on tail page any more.) So when we need thp split, we have to
      transfer the refcount properly to the 4kB soft-offlined page before
      migration.
      
      thp split code goes into core code only when precheck
      (total_mapcount(head) == page_count(head) - 1) passes to avoid useless
      split, where we assume that one refcount is held by the caller of thp
      split and the others are taken via mapping.  To meet this assumption,
      this patch moves thp split part in soft_offline_page() after
      get_any_page().
      
      [akpm@linux-foundation.org: remove unneeded #define, per Kirill]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A.  Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e41a30c
    • Kirill A. Shutemov's avatar
      mm: differentiate page_mapped() from page_mapcount() for compound pages · e1534ae9
      Kirill A. Shutemov authored
      
      
      Let's define page_mapped() to be true for compound pages if any
      sub-pages of the compound page is mapped (with PMD or PTE).
      
      On other hand page_mapcount() return mapcount for this particular small
      page.
      
      This will make cases like page_get_anon_vma() behave correctly once we
      allow huge pages to be mapped with PTE.
      
      Most users outside core-mm should use page_mapcount() instead of
      page_mapped().
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1534ae9
    • Kirill A. Shutemov's avatar
      mm: rework mapcount accounting to enable 4k mapping of THPs · 53f9263b
      Kirill A. Shutemov authored
      
      
      We're going to allow mapping of individual 4k pages of THP compound.  It
      means we need to track mapcount on per small page basis.
      
      Straight-forward approach is to use ->_mapcount in all subpages to track
      how many time this subpage is mapped with PMDs or PTEs combined.  But
      this is rather expensive: mapping or unmapping of a THP page with PMD
      would require HPAGE_PMD_NR atomic operations instead of single we have
      now.
      
      The idea is to store separately how many times the page was mapped as
      whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
      track PTE mapcount.
      
      We use the same approach as with compound page destructor and compound
      order to store compound_mapcount: use space in first tail page,
      ->mapping this time.
      
      Any time we map/unmap whole compound page (THP or hugetlb) -- we
      increment/decrement compound_mapcount.  When we map part of compound
      page with PTE we operate on ->_mapcount of the subpage.
      
      page_mapcount() counts both: PTE and PMD mappings of the page.
      
      Basically, we have mapcount for a subpage spread over two counters.  It
      makes tricky to detect when last mapcount for a page goes away.
      
      We introduced PageDoubleMap() for this.  When we split THP PMD for the
      first time and there's other PMD mapping left we offset up ->_mapcount
      in all subpages by one and set PG_double_map on the compound page.
      These additional references go away with last compound_mapcount.
      
      This approach provides a way to detect when last mapcount goes away on
      per small page basis without introducing new overhead for most common
      cases.
      
      [akpm@linux-foundation.org: fix typo in comment]
      [mhocko@suse.com: ignore partial THP when moving task]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53f9263b
    • Kirill A. Shutemov's avatar
      mm, thp: remove compound_lock() · 3ac808fd
      Kirill A. Shutemov authored
      
      
      We are going to use migration entries to stabilize page counts.  It
      means we don't need compound_lock() for that.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ac808fd
    • Kirill A. Shutemov's avatar
      mm: drop tail page refcounting · ddc58f27
      Kirill A. Shutemov authored
      
      
      Tail page refcounting is utterly complicated and painful to support.
      
      It uses ->_mapcount on tail pages to store how many times this page is
      pinned.  get_page() bumps ->_mapcount on tail page in addition to
      ->_count on head.  This information is required by split_huge_page() to
      be able to distribute pins from head of compound page to tails during
      the split.
      
      We will need ->_mapcount to account PTE mappings of subpages of the
      compound page.  We eliminate need in current meaning of ->_mapcount in
      tail pages by forbidding split entirely if the page is pinned.
      
      The only user of tail page refcounting is THP which is marked BROKEN for
      now.
      
      Let's drop all this mess.  It makes get_page() and put_page() much
      simpler.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddc58f27
  2. 15 Jan, 2016 4 commits
    • Konstantin Khlebnikov's avatar
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov authored
      
      
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84638335
    • Michal Hocko's avatar
      mm: allow GFP_{FS,IO} for page_cache_read page cache allocation · c20cd45e
      Michal Hocko authored
      
      
      page_cache_read has been historically using page_cache_alloc_cold to
      allocate a new page.  This means that mapping_gfp_mask is used as the
      base for the gfp_mask.  Many filesystems are setting this mask to
      GFP_NOFS to prevent from fs recursion issues.  page_cache_read is called
      from the vm_operations_struct::fault() context during the page fault.
      This context doesn't need the reclaim protection normally.
      
      ceph and ocfs2 which call filemap_fault from their fault handlers seem
      to be OK because they are not taking any fs lock before invoking generic
      implementation.  xfs which takes XFS_MMAPLOCK_SHARED is safe from the
      reclaim recursion POV because this lock serializes truncate and punch
      hole with the page faults and it doesn't get involved in the reclaim.
      
      There is simply no reason to deliberately use a weaker allocation
      context when a __GFP_FS | __GFP_IO can be used.  The GFP_NOFS protection
      might be even harmful.  There is a push to fail GFP_NOFS allocations
      rather than loop within allocator indefinitely with a very limited
      reclaim ability.  Once we start failing those requests the OOM killer
      might be triggered prematurely because the page cache allocation failure
      is propagated up the page fault path and end up in
      pagefault_out_of_memory.
      
      We cannot play with mapping_gfp_mask directly because that would be racy
      wrt.  parallel page faults and it might interfere with other users who
      really rely on NOFS semantic from the stored gfp_mask.  The mask is also
      inode proper so it would even be a layering violation.  What we can do
      instead is to push the gfp_mask into struct vm_fault and allow fs layer
      to overwrite it should the callback need to be called with a different
      allocation context.
      
      Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
      because this should be safe from the page fault path normally.  Why do
      we care about mapping_gfp_mask at all then? Because this doesn't hold
      only reclaim protection flags but it also might contain zone and
      movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
      to respect those.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarJan Kara <jack@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c20cd45e
    • Daniel Cashman's avatar
      mm: mmap: add new /proc tunable for mmap_base ASLR · d07e2259
      Daniel Cashman authored
      Address Space Layout Randomization (ASLR) provides a barrier to
      exploitation of user-space processes in the presence of security
      vulnerabilities by making it more difficult to find desired code/data
      which could help an attack.  This is done by adding a random offset to
      the location of regions in the process address space, with a greater
      range of potential offset values corresponding to better protection/a
      larger search-space for brute force, but also to greater potential for
      fragmentation.
      
      The offset added to the mmap_base address, which provides the basis for
      the majority of the mappings for a process, is set once on process exec
      in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
      which reflect, hopefully, the best compromise for all systems.  The
      trade-off between increased entropy in the offset value generation and
      the corresponding increased variability in address space fragmentation
      is not absolute, however, and some platforms may tolerate higher amounts
      of entropy.  This patch introduces both new Kconfig values and a sysctl
      interface which may be used to change the amount of entropy used for
      offset generation on a system.
      
      The direct motivation for this change was in response to the
      libstagefright vulnerabilities that affected Android, specifically to
      information provided by Google's project zero at:
      
        http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
      
      
      
      The attack presented therein, by Google's project zero, specifically
      targeted the limited randomness used to generate the offset added to the
      mmap_base address in order to craft a brute-force-based attack.
      Concretely, the attack was against the mediaserver process, which was
      limited to respawning every 5 seconds, on an arm device.  The hard-coded
      8 bits used resulted in an average expected success rate of defeating
      the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
      piece).  With this patch, and an accompanying increase in the entropy
      value to 16 bits, the same attack would take an average expected time of
      over 45 hours (32768 tries), which makes it both less feasible and more
      likely to be noticed.
      
      The introduced Kconfig and sysctl options are limited by per-arch
      minimum and maximum values, the minimum of which was chosen to match the
      current hard-coded value and the maximum of which was chosen so as to
      give the greatest flexibility without generating an invalid mmap_base
      address, generally a 3-4 bits less than the number of bits in the
      user-space accessible virtual address space.
      
      When decided whether or not to change the default value, a system
      developer should consider that mmap_base address could be placed
      anywhere up to 2^(value) bits away from the non-randomized location,
      which would introduce variable-sized areas above and below the mmap_base
      address such that the maximum vm_area_struct size may be reduced,
      preventing very large allocations.
      
      This patch (of 4):
      
      ASLR only uses as few as 8 bits to generate the random offset for the
      mmap base address on 32 bit architectures.  This value was chosen to
      prevent a poorly chosen value from dividing the address space in such a
      way as to prevent large allocations.  This may not be an issue on all
      platforms.  Allow the specification of a minimum number of bits so that
      platforms desiring greater ASLR protection may determine where to place
      the trade-off.
      Signed-off-by: default avatarDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d07e2259
    • Jerome Marchand's avatar
      mm, shmem: add internal shmem resident memory accounting · eca56ff9
      Jerome Marchand authored
      
      
      Currently looking at /proc/<pid>/status or statm, there is no way to
      distinguish shmem pages from pages mapped to a regular file (shmem pages
      are mapped to /dev/zero), even though their implication in actual memory
      use is quite different.
      
      The internal accounting currently counts shmem pages together with
      regular files.  As a preparation to extend the userspace interfaces,
      this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
      shmem pages separately from MM_FILEPAGES.  The next patch will expose it
      to userspace - this patch doesn't change the exported values yet, by
      adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
      used before.  The only user-visible change after this patch is the OOM
      killer message that separates the reported "shmem-rss" from "file-rss".
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eca56ff9
  3. 07 Nov, 2015 3 commits
    • Kirill A. Shutemov's avatar
      mm: use 'unsigned int' for page order · d00181b9
      Kirill A. Shutemov authored
      
      
      Let's try to be consistent about data type of page order.
      
      [sfr@canb.auug.org.au: fix build (type of pageblock_order)]
      [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d00181b9
    • Kirill A. Shutemov's avatar
      mm: make compound_head() robust · 1d798ca3
      Kirill A. Shutemov authored
      Hugh has pointed that compound_head() call can be unsafe in some
      context. There's one example:
      
      	CPU0					CPU1
      
      isolate_migratepages_block()
        page_count()
          compound_head()
            !!PageTail() == true
      					put_page()
      					  tail->first_page = NULL
            head = tail->first_page
      					alloc_pages(__GFP_COMP)
      					   prep_compound_page()
      					     tail->first_page = head
      					     __SetPageTail(p);
            !!PageTail() == true
          <head == NULL dereferencing>
      
      The race is pure theoretical. I don't it's possible to trigger it in
      practice. But who knows.
      
      We can fix the race by changing how encode PageTail() and compound_head()
      within struct page to be able to update them in one shot.
      
      The patch introduces page->compound_head into third double word block in
      front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
      the rest bits are pointer to head page if bit zero is set.
      
      The patch moves page->pmd_huge_pte out of word, just in case if an
      architecture defines pgtable_t into something what can have the bit 0
      set.
      
      hugetlb_cgroup uses page->lru.next in the second tail page to store
      pointer struct hugetlb_cgroup. The patch switch it to use page->private
      in the second tail page instead. The space is free since ->first_page is
      removed from the union.
      
      The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
      limitation, since there's now space in first tail page to store struct
      hugetlb_cgroup pointer. But that's out of scope of the patch.
      
      That means page->compound_head shares storage space with:
      
       - page->lru.next;
       - page->next;
       - page->rcu_head.next;
      
      That's too long list to be absolutely sure, but looks like nobody uses
      bit 0 of the word.
      
      page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
      call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
      call_rcu_lazy() is not allowed as it makes use of the bit and we can
      get false positive PageTail().
      
      [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d798ca3
    • Kirill A. Shutemov's avatar
      mm: pack compound_dtor and compound_order into one word in struct page · f1e61557
      Kirill A. Shutemov authored
      
      
      The patch halves space occupied by compound_dtor and compound_order in
      struct page.
      
      For compound_order, it's trivial long -> short conversion.
      
      For get_compound_page_dtor(), we now use hardcoded table for destructor
      lookup and store its index in the struct page instead of direct pointer
      to destructor. It shouldn't be a big trouble to maintain the table: we
      have only two destructor and NULL currently.
      
      This patch free up one word in tail pages for reuse. This is preparation
      for the next patch.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1e61557
  4. 06 Nov, 2015 3 commits
    • Eric B Munson's avatar
      mm: introduce VM_LOCKONFAULT · de60f5f1
      Eric B Munson authored
      
      
      The cost of faulting in all memory to be locked can be very high when
      working with large mappings.  If only portions of the mapping will be used
      this can incur a high penalty for locking.
      
      For the example of a large file, this is the usage pattern for a large
      statical language model (probably applies to other statical or graphical
      models as well).  For the security example, any application transacting in
      data that cannot be swapped out (credit card data, medical records, etc).
      
      This patch introduces the ability to request that pages are not
      pre-faulted, but are placed on the unevictable LRU when they are finally
      faulted in.  The VM_LOCKONFAULT flag will be used together with VM_LOCKED
      and has no effect when set without VM_LOCKED.  Setting the VM_LOCKONFAULT
      flag for a VMA will cause pages faulted into that VMA to be added to the
      unevictable LRU when they are faulted or if they are already present, but
      will not cause any missing pages to be faulted in.
      
      Exposing this new lock state means that we cannot overload the meaning of
      the FOLL_POPULATE flag any longer.  Prior to this patch it was used to
      mean that the VMA for a fault was locked.  This means we need the new
      FOLL_MLOCK flag to communicate the locked state of a VMA.  FOLL_POPULATE
      will now only control if the VMA should be populated and in the case of
      VM_LOCKONFAULT, it will not be set.
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de60f5f1
    • Vladimir Davydov's avatar
      mm: do not inc NR_PAGETABLE if ptlock_init failed · 706874e9
      Vladimir Davydov authored
      
      
      If ALLOC_SPLIT_PTLOCKS is defined, ptlock_init may fail, in which case we
      shouldn't increment NR_PAGETABLE.
      
      Since small allocations, such as ptlock, normally do not fail (currently
      they can fail if kmemcg is used though), this patch does not really fix
      anything and should be considered as a code cleanup.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      706874e9
    • Roman Gushchin's avatar
      mm: use only per-device readahead limit · 600e19af
      Roman Gushchin authored
      Maximal readahead size is limited now by two values:
       1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
       2) by configurable per-device value* (bdi->ra_pages)
      
      There are devices, which require custom readahead limit.
      For instance, for RAIDs it's calculated as number of devices
      multiplied by chunk size times 2.
      
      Readahead size can never be larger than bdi->ra_pages * 2 value
      (POSIX_FADV_SEQUNTIAL doubles readahead size).
      
      If so, why do we need two limits?
      I suggest to completely remove this max_sane_readahead() stuff and
      use per-device readahead limit everywhere.
      
      Also, using right readahead size for RAID disks can significantly
      increase i/o performance:
      
      before:
        dd if=/dev/md2 of=/dev/null bs=100M count=100
        100+0 records in
        100+0 records out
        10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s
      
      after:
        $ dd if=/dev/md2 of=/dev/null bs=100M count=100
        100+0 records in
        100+0 records out
        10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s
      
      (It's an 8-disks RAID5 storage).
      
      This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
      behavior introduced by 6d2be915
      
       ("mm/readahead.c: fix readahead
      failure for memoryless NUMA nodes and limit readahead pages").
      Signed-off-by: default avatarRoman Gushchin <klamm@yandex-team.ru>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: onstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      600e19af
  5. 02 Oct, 2015 1 commit
    • Greg Thelen's avatar
      memcg: fix dirty page migration · 0610c25d
      Greg Thelen authored
      The problem starts with a file backed dirty page which is charged to a
      memcg.  Then page migration is used to move oldpage to newpage.
      
      Migration:
       - copies the oldpage's data to newpage
       - clears oldpage.PG_dirty
       - sets newpage.PG_dirty
       - uncharges oldpage from memcg
       - charges newpage to memcg
      
      Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
      count.
      
      However, because newpage is not yet charged, setting newpage.PG_dirty
      does not increment the memcg's dirty page count.  After migration
      completes newpage.PG_dirty is eventually cleared, often in
      account_page_cleaned().  At this time newpage is charged to a memcg so
      the memcg's dirty page count is decremented which causes underflow
      because the count was not previously incremented by migration.  This
      underflow causes balance_dirty_pages() to see a very large unsigned
      number of dirty memcg pages which leads to aggressive throttling of
      buffered writes by processes in non root memcg.
      
      This issue:
       - can harm performance of non root memcg buffered writes.
       - can report too small (even negative) values in
         memory.stat[(total_)dirty] counters of all memcg, including the root.
      
      To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
      page_memcg() and set_page_memcg() helpers.
      
      Test:
          0) setup and enter limited memcg
          mkdir /sys/fs/cgroup/test
          echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
          echo $$ > /sys/fs/cgroup/test/cgroup.procs
      
          1) buffered writes baseline
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          2) buffered writes with compaction antagonist to induce migration
          yes 1 > /proc/sys/vm/compact_memory &
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          kill %
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          3) buffered writes without antagonist, should match baseline
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
                             (speed, dirty residue)
                   unpatched                       patched
          1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
          2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
          3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages
      
          Notice that unpatched baseline performance (1) fell after
          migration (3): 841 -> 114 MB/s.  In the patched kernel, post
          migration performance matches baseline.
      
      Fixes: c4843a75
      
       ("memcg: add per cgroup dirty page accounting")
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0610c25d
  6. 10 Sep, 2015 1 commit
  7. 08 Sep, 2015 5 commits
  8. 04 Sep, 2015 3 commits
  9. 21 Aug, 2015 1 commit
    • Michal Hocko's avatar
      mm: make page pfmemalloc check more robust · 2f064f34
      Michal Hocko authored
      Commit c48a11c7 ("netvm: propagate page->pfmemalloc to skb") added
      checks for page->pfmemalloc to __skb_fill_page_desc():
      
              if (page->pfmemalloc && !page->mapping)
                      skb->pfmemalloc = true;
      
      It assumes page->mapping == NULL implies that page->pfmemalloc can be
      trusted.  However, __delete_from_page_cache() can set set page->mapping
      to NULL and leave page->index value alone.  Due to being in union, a
      non-zero page->index will be interpreted as true page->pfmemalloc.
      
      So the assumption is invalid if the networking code can see such a page.
      And it seems it can.  We have encountered this with a NFS over loopback
      setup when such a page is attached to a new skbuf.  There is no copying
      going on in this case so the page confuses __skb_fill_page_desc which
      interprets the index as pfmemalloc flag and the network stack drops
      packets that have been allocated using the reserves unless they are to
      be queued on sockets handling the swapping which is the ...
      2f064f34
  10. 16 Aug, 2015 1 commit
  11. 11 Aug, 2015 1 commit
    • Dan Williams's avatar
      mm: enhance region_is_ram() to region_intersects() · 124fe20d
      Dan Williams authored
      
      
      region_is_ram() is used to prevent the establishment of aliased mappings
      to physical "System RAM" with incompatible cache settings.  However, it
      uses "-1" to indicate both "unknown" memory ranges (ranges not described
      by platform firmware) and "mixed" ranges (where the parameters describe
      a range that partially overlaps "System RAM").
      
      Fix this up by explicitly tracking the "unknown" vs "mixed" resource
      cases and returning REGION_INTERSECTS, REGION_MIXED, or REGION_DISJOINT.
      This re-write also adds support for detecting when the requested region
      completely eclipses all of a resource.  Note, the implementation treats
      overlaps between "unknown" and the requested memory type as
      REGION_INTERSECTS.
      
      Finally, other memory types can be passed in by name, for now the only
      usage "System RAM".
      Suggested-by: default avatarLuis R. Rodriguez <mcgrof@suse.com>
      Reviewed-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      124fe20d
  12. 01 Jul, 2015 2 commits
  13. 25 Jun, 2015 4 commits
    • Xie XiuQi's avatar
      memory-failure: change type of action_result's param 3 to enum · cc3e2af4
      Xie XiuQi authored
      
      
      Change type of action_result's param 3 to enum for type consistency,
      and rename mf_outcome to mf_result for clearly.
      Signed-off-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc3e2af4
    • Xie XiuQi's avatar
      memory-failure: export page_type and action result · cc637b17
      Xie XiuQi authored
      
      
      Export 'outcome' and 'action_page_type' to mm.h, so we could use
      this emnus outside.
      
      This patch is preparation for adding trace events for memory-failure
      recovery action.
      Signed-off-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc637b17
    • Naoya Horiguchi's avatar
      mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling · ead07f6a
      Naoya Horiguchi authored
      
      
      memory_failure() can run in 2 different mode (specified by
      MF_COUNT_INCREASED) in page refcount perspective.  When
      MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
      takes a refcount of the target page.  And if cleared, memory_failure()
      takes it in it's own.
      
      In current code, however, refcounting is done differently in each caller.
      For example, madvise_hwpoison() uses get_user_pages_fast() and
      hwpoison_inject() uses get_page_unless_zero().  So this inconsistent
      refcounting causes refcount failure especially for thp tail pages.
      Typical user visible effects are like memory leak or
      VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().
      
      To fix this refcounting issue, this patch introduces get_hwpoison_page()
      to handle thp tail pages in the same manner for each caller of hwpoison
      code.
      
      memory_failure() might fail to split thp and in such case it returns
      without completing page isolation.  This is not good because PageHWPoison
      on the thp is still set and there's no easy way to unpoison such thps.  So
      this patch try to roll back any action to the thp in "non anonymous thp"
      case and "thp split failed" case, expecting an MCE(SRAR) generated by
      later access afterward will properly free such thps.
      
      [akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ead07f6a
    • Kirill A. Shutemov's avatar
      mm: avoid tail page refcounting on non-THP compound pages · c761471b
      Kirill A. Shutemov authored
      Reintroduce 8d63d99a
      
       ("mm: avoid tail page refcounting on non-THP
      compound pages") after removing bogus VM_BUG_ON_PAGE() in
      put_unrefcounted_compound_page().
      
      THP uses tail page refcounting to be able to split huge pages at any time.
       Tail page refcounting is not needed for other users of compound pages and
      it's harmful because of overhead.
      
      We try to exclude non-THP pages from tail page refcounting using
      __compound_tail_refcounted() check.  It excludes most common non-THP
      compound pages: SL*B and hugetlb, but it doesn't catch rest of __GFP_COMP
      users -- drivers.
      
      And it's not only about overhead.
      
      Drivers might want to use compound pages to get refcounting semantics
      suitable for mapping high-order pages to userspace.  But tail page
      refcounting breaks it.
      
      Tail page refcounting uses ->_mapcount in tail pages to store GUP pins on
      them.  It means GUP pins would affect page_mapcount() for tail pages.
      It's not a problem for THP, because it never maps tail pages.  But unlike
      THP, drivers map parts of compound pages with PTEs and it makes
      page_mapcount() be called for tail pages.
      
      In particular, GUP pins would shift PSS up and affect /proc/kpagecount for
      such pages.  But, I'm not aware about anything which can lead to crash or
      other serious misbehaviour.
      
      Since currently all THP pages are anonymous and all drivers pages are not,
      we can fix the __compound_tail_refcounted() check by requiring PageAnon()
      to enable tail page refcounting.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c761471b
  14. 02 Jun, 2015 3 commits
    • Tejun Heo's avatar
      writeback: implement unlocked_inode_to_wb transaction and use it for stat updates · 682aa8e1
      Tejun Heo authored
      
      
      The mechanism for detecting whether an inode should switch its wb
      (bdi_writeback) association is now in place.  This patch build the
      framework for the actual switching.
      
      This patch adds a new inode flag I_WB_SWITCHING, which has two
      functions.  First, the easy one, it ensures that there's only one
      switching in progress for a give inode.  Second, it's used as a
      mechanism to synchronize wb stat updates.
      
      The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
      but track the current number of dirty pages and pages under writeback
      respectively.  As such, when an inode is moved from one wb to another,
      the inode's portion of those stats have to be transferred together;
      unfortunately, this is a bit tricky as those stat updates are percpu
      operations which are performed without holding any lock in some
      places.
      
      This patch solves the problem in a similar way as memcg.  Each such
      lockless stat updates are wrapped in transaction surrounded by
      unlocked_inode_to_wb_begin/end().  During normal operation, they map
      to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
      mapping->tree_lock is grabbed across the transaction.
      
      In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
      grace period to pass before actually starting to switch, which
      guarantees that all stat update paths are synchronizing against
      mapping->tree_lock.
      
      This patch still doesn't implement the actual switching.
      
      v3: Updated on top of the recent cancel_dirty_page() updates.
          unlocked_inode_to_wb_begin() now nests inside
          mem_cgroup_begin_page_stat() to match the locking order.
      
      v2: The i_wb access transaction will be used for !stat accesses too.
          Function names and comments updated accordingly.
      
          s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
          s/switch_wb/switch_wbs/
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      682aa8e1
    • Greg Thelen's avatar
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen authored
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2
      
      .  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: default avatarSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c4843a75
    • Tejun Heo's avatar
      page_writeback: revive cancel_dirty_page() in a restricted form · 11f81bec
      Tejun Heo authored
      cancel_dirty_page() had some issues and b9ea2515
      
       ("page_writeback:
      clean up mess around cancel_dirty_page()") replaced it with
      account_page_cleaned() which makes the caller responsible for clearing
      the dirty bit; unfortunately, the planned changes for cgroup writeback
      support requires synchronization between dirty bit manipulation and
      stat updates.  While we can open-code such synchronization in each
      account_page_cleaned() callsite, that's gonna be unnecessarily awkward
      and verbose.
      
      This patch revives cancel_dirty_page() but in a more restricted form.
      All it does is TestClearPageDirty() followed by account_page_cleaned()
      invocation if the page was dirty.  This helper covers all
      account_page_cleaned() usages except for __delete_from_page_cache()
      which is a special case anyway and left alone.  As this leaves no
      module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
      from it.
      
      This patch just revives cancel_dirty_page() as a trivial wrapper to
      replace equivalent usages and doesn't introduce any functional
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      11f81bec