1. 08 Sep, 2015 5 commits
    • Mike Kravetz's avatar
      hugetlbfs: add hugetlbfs_fallocate() · 70c3547e
      Mike Kravetz authored
      This is based on the shmem version, but it has diverged quite a bit.  We
      have no swap to worry about, nor the new file sealing.  Add
      synchronication via the fault mutex table to coordinate page faults,
      fallocate allocation and fallocate hole punch.
      
      What this allows us to do is move physical memory in and out of a
      hugetlbfs file without having it mapped.  This also gives us the ability
      to support MADV_REMOVE since it is currently implemented using
      fallocate().  MADV_REMOVE lets madvise() remove pages from the middle of
      a hugetlbfs file, which wasn't possible before.
      
      hugetlbfs fallocate only operates on whole huge pages.
      
      Based on code by Dave Hansen.
      Signed-off-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: 's avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      70c3547e
    • Mike Kravetz's avatar
      hugetlbfs: New huge_add_to_page_cache helper routine · ab76ad54
      Mike Kravetz authored
      Currently, there is only a single place where hugetlbfs pages are added
      to the page cache.  The new fallocate code be adding a second one, so
      break the functionality out into its own helper.
      Signed-off-by: 's avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: 's avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab76ad54
    • Mike Kravetz's avatar
      hugetlbfs: truncate_hugepages() takes a range of pages · b5cec28d
      Mike Kravetz authored
      Modify truncate_hugepages() to take a range of pages (start, end)
      instead of simply start.  If an end value of LLONG_MAX is passed, the
      current "truncate" functionality is maintained.  Existing callers are
      modified to pass LLONG_MAX as end of range.  By keying off end ==
      LLONG_MAX, the routine behaves differently for truncate and hole punch.
      Page removal is now synchronized with page allocation via faults by
      using the fault mutex table.  The hole punch case can experience the
      rare region_del error and must handle accordingly.
      
      Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
      the case where region_del returns an error.
      
      Since the routine handles more than just the truncate case, it is
      renamed to remove_inode_hugepages().  To be consistent, the routine
      truncate_huge_page() is renamed remove_huge_page().
      
      Downstream of remove_inode_hugepages(), the routine
      hugetlb_unreserve_pages() is also modified to take a range of pages.
      hugetlb_unreserve_pages is modified to detect an error from region_del and
      pass it back to the caller.
      Signed-off-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: 's avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5cec28d
    • Mike Kravetz's avatar
      mm/hugetlb: expose hugetlb fault mutex for use by fallocate · c672c7f2
      Mike Kravetz authored
      hugetlb page faults are currently synchronized by the table of mutexes
      (htlb_fault_mutex_table).  fallocate code will need to synchronize with
      the page fault code when it allocates or deletes pages.  Expose
      interfaces so that fallocate operations can be synchronized with page
      faults.  Minor name changes to be more consistent with other global
      hugetlb symbols.
      Signed-off-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: 's avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      c672c7f2
    • Mike Kravetz's avatar
      mm/hugetlb: add cache of descriptors to resv_map for region_add · 5e911373
      Mike Kravetz authored
      hugetlbfs is used today by applications that want a high degree of
      control over huge page usage.  Often, large hugetlbfs files are used to
      map a large number huge pages into the application processes.  The
      applications know when page ranges within these large files will no
      longer be used, and ideally would like to release them back to the
      subpool or global pools for other uses.  The fallocate() system call
      provides an interface for preallocation and hole punching within files.
      This patch set adds fallocate functionality to hugetlbfs.
      
      fallocate hole punch will want to remove a specific range of pages.
      When pages are removed, their associated entries in the region/reserve
      map will also be removed.  This will break an assumption in the
      region_chg/region_add calling sequence.  If a new region descriptor must
      be allocated, it is done as part of the region_chg processing.  In this
      way, region_add can not fail because it does not need to attempt an
      allocation.
      
      To prepare for fallocate hole punch, create a "cache" of descriptors
      that can be used by region_add if necessary.  region_chg will ensure
      there are sufficient entries in the cache.  It will be necessary to
      track the number of in progress add operations to know a sufficient
      number of descriptors reside in the cache.  A new routine region_abort
      is added to adjust this in progress count when add operations are
      aborted.  vma_abort_reservation is also added for callers creating
      reservations with vma_needs_reservation/vma_commit_reservation.
      
      [akpm@linux-foundation.org: fix typo in comment, use more cols]
      Signed-off-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: 's avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e911373
  2. 17 Jul, 2015 1 commit
  3. 15 Apr, 2015 4 commits
    • Naoya Horiguchi's avatar
      mm: hugetlb: cleanup using paeg_huge_active() · 7e1f049e
      Naoya Horiguchi authored
      Now we have an easy access to hugepages' activeness, so existing helpers to
      get the information can be cleaned up.
      
      [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
      Signed-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: 's avatarMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e1f049e
    • Mike Kravetz's avatar
      hugetlbfs: accept subpool min_size mount option and setup accordingly · 7ca02d0a
      Mike Kravetz authored
      Make 'min_size=<value>' be an option when mounting a hugetlbfs.  This
      option takes the same value as the 'size' option.  min_size can be
      specified without specifying size.  If both are specified, min_size must
      be less that or equal to size else the mount will fail.  If min_size is
      specified, then at mount time an attempt is made to reserve min_size
      pages.  If the reservation fails, the mount fails.  At umount time, the
      reserved pages are released.
      Signed-off-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ca02d0a
    • Mike Kravetz's avatar
      hugetlbfs: add minimum size tracking fields to subpool structure · c6a91820
      Mike Kravetz authored
      hugetlbfs allocates huge pages from the global pool as needed.  Even if
      the global pool contains a sufficient number pages for the filesystem size
      at mount time, those global pages could be grabbed for some other use.  As
      a result, filesystem huge page allocations may fail due to lack of pages.
      
      Applications such as a database want to use huge pages for performance
      reasons.  hugetlbfs filesystem semantics with ownership and modes work
      well to manage access to a pool of huge pages.  However, the application
      would like some reasonable assurance that allocations will not fail due to
      a lack of huge pages.  At application startup time, the application would
      like to configure itself to use a specific number of huge pages.  Before
      starting, the application can check to make sure that enough huge pages
      exist in the system global pools.  However, there are no guarantees that
      those pages will be available when needed by the application.  What the
      application wants is exclusive use of a subset of huge pages.
      
      Add a new hugetlbfs mount option 'min_size=<value>' to indicate that the
      specified number of pages will be available for use by the filesystem.  At
      mount time, this number of huge pages will be reserved for exclusive use
      of the filesystem.  If there is not a sufficient number of free pages, the
      mount will fail.  As pages are allocated to and freeed from the
      filesystem, the number of reserved pages is adjusted so that the specified
      minimum is maintained.
      
      This patch (of 4):
      
      Add a field to the subpool structure to indicate the minimimum number of
      huge pages to always be used by this subpool.  This minimum count includes
      allocated pages as well as reserved pages.  If the minimum number of pages
      for the subpool have not been allocated, pages are reserved up to this
      minimum.  An additional field (rsv_hpages) is used to track the number of
      pages reserved to meet this minimum size.  The hstate pointer in the
      subpool is convenient to have when reserving and unreserving the pages.
      Signed-off-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6a91820
    • Kirill A. Shutemov's avatar
      mm: consolidate all page-flags helpers in <linux/page-flags.h> · e8c6158f
      Kirill A. Shutemov authored
      Currently we take a naive approach to page flags on compound pages - we
      set the flag on the page without consideration if the flag makes sense
      for tail page or for compound page in general.  This patchset try to
      sort this out by defining per-flag policy on what need to be done if
      page-flag helper operate on compound page.
      
      The last patch in the patchset also sanitizes usege of page->mapping for
      tail pages.  We don't define the meaning of page->mapping for tail
      pages.  Currently it's always NULL, which can be inconsistent with head
      page and potentially lead to problems.
      
      For now I caught one case of illegal usage of page flags or ->mapping:
      sound subsystem allocates pages with __GFP_COMP and maps them with PTEs.
      It leads to setting dirty bit on tail pages and access to tail_page's
      ->mapping.  I don't see any bad behaviour caused by this, but worth
      fixing anyway.
      
      This patchset makes more sense if you take my THP refcounting into
      account: we will see more compound pages mapped with PTEs and we need to
      define behaviour of flags on compound pages to avoid bugs.
      
      This patch (of 16):
      
      We have page-flags helper function declarations/definitions spread over
      several header files.  Let's consolidate them in <linux/page-flags.h>.
      Signed-off-by: 's avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: 's avatarHugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8c6158f
  4. 12 Feb, 2015 1 commit
    • Naoya Horiguchi's avatar
      mm/hugetlb: take page table lock in follow_huge_pmd() · e66f17ff
      Naoya Horiguchi authored
      We have a race condition between move_pages() and freeing hugepages, where
      move_pages() calls follow_page(FOLL_GET) for hugepages internally and
      tries to get its refcount without preventing concurrent freeing.  This
      race crashes the kernel, so this patch fixes it by moving FOLL_GET code
      for hugepages into follow_huge_pmd() with taking the page table lock.
      
      This patch intentionally removes page==NULL check after pte_page.
      This is justified because pte_page() never returns NULL for any
      architectures or configurations.
      
      This patch changes the behavior of follow_huge_pmd() for tail pages and
      then tail pages can be pinned/returned.  So the caller must be changed to
      properly handle the returned tail pages.
      
      We could have a choice to add the similar locking to
      follow_huge_(addr|pud) for consistency, but it's not necessary because
      currently these functions don't support FOLL_GET flag, so let's leave it
      for future development.
      
      Here is the reproducer:
      
        $ cat movepages.c
        #include <stdio.h>
        #include <stdlib.h>
        #include <numaif.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
        #define PS              0x1000
      
        int main(int argc, char *argv[]) {
                int i;
                int nr_hp = strtol(argv[1], NULL, 0);
                int nr_p  = nr_hp * HPS / PS;
                int ret;
                void **addrs;
                int *status;
                int *nodes;
                pid_t pid;
      
                pid = strtol(argv[2], NULL, 0);
                addrs  = malloc(sizeof(char *) * nr_p + 1);
                status = malloc(sizeof(char *) * nr_p + 1);
                nodes  = malloc(sizeof(char *) * nr_p + 1);
      
                while (1) {
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 1;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
      
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 0;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
                }
                return 0;
        }
      
        $ cat hugepage.c
        #include <stdio.h>
        #include <sys/mman.h>
        #include <string.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
      
        int main(int argc, char *argv[]) {
                int nr_hp = strtol(argv[1], NULL, 0);
                char *p;
      
                while (1) {
                        p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                        if (p != (void *)ADDR_INPUT) {
                                perror("mmap");
                                break;
                        }
                        memset(p, 0, nr_hp * HPS);
                        munmap(p, nr_hp * HPS);
                }
        }
      
        $ sysctl vm.nr_hugepages=40
        $ ./hugepage 10 &
        $ ./movepages 10 $(pgrep -f hugepage)
      
      Fixes: e632a938 ("mm: migrate: add hugepage migration code to move_pages()")
      Signed-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: 's avatarHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      e66f17ff
  5. 10 Feb, 2015 1 commit
  6. 11 Dec, 2014 1 commit
  7. 14 Nov, 2014 1 commit
  8. 07 Aug, 2014 1 commit
  9. 31 Jul, 2014 1 commit
    • Atsushi Kumagai's avatar
      kexec: export free_huge_page to VMCOREINFO · 8f1d26d0
      Atsushi Kumagai authored
      PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56b
      ("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
      another symbol to filter *hugetlbfs* pages.
      
      If a user hope to filter user pages, makedumpfile tries to exclude them by
      checking the condition whether the page is anonymous, but hugetlbfs pages
      aren't anonymous while they also be user pages.
      
      We know it's possible to detect them in the same way as PageHuge(),
      so we need the start address of free_huge_page():
      
          int PageHuge(struct page *page)
          {
                  if (!PageCompound(page))
                          return 0;
      
                  page = compound_head(page);
                  return get_compound_page_dtor(page) == free_huge_page;
          }
      
      For that reason, this patch changes free_huge_page() into public
      to export it to VMCOREINFO.
      Signed-off-by: 's avatarAtsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Acked-by: 's avatarBaoquan He <bhe@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f1d26d0
  10. 04 Jun, 2014 3 commits
  11. 06 May, 2014 1 commit
    • Nishanth Aravamudan's avatar
      hugetlb: ensure hugepage access is denied if hugepages are not supported · 457c1b27
      Nishanth Aravamudan authored
      Currently, I am seeing the following when I `mount -t hugetlbfs /none
      /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
      related to the fact that hugetlbfs is properly not correctly setting
      itself up in this state?:
      
        Unable to handle kernel paging request for data at address 0x00000031
        Faulting instruction address: 0xc000000000245710
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        ....
      
      In KVM guests on Power, in a guest not backed by hugepages, we see the
      following:
      
        AnonHugePages:         0 kB
        HugePages_Total:       0
        HugePages_Free:        0
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:         64 kB
      
      HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
      are not supported at boot-time, but this is only checked in
      hugetlb_init().  Extract the check to a helper function, and use it in a
      few relevant places.
      
      This does make hugetlbfs not supported (not registered at all) in this
      environment.  I believe this is fine, as there are no valid hugepages
      and that won't change at runtime.
      
      [akpm@linux-foundation.org: use pr_info(), per Mel]
      [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
      Signed-off-by: 's avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: 's avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: 's avatarMel Gorman <mgorman@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      457c1b27
  12. 03 Apr, 2014 2 commits
  13. 24 Jan, 2014 1 commit
  14. 22 Jan, 2014 2 commits
    • Andrea Arcangeli's avatar
      mm: tail page refcounting optimization for slab and hugetlbfs · 44518d2b
      Andrea Arcangeli authored
      This skips the _mapcount mangling for slab and hugetlbfs pages.
      
      The main trouble in doing this is to guarantee that PageSlab and
      PageHeadHuge remains constant for all get_page/put_page run on the tail
      of slab or hugetlbfs compound pages.  Otherwise if they're set during
      get_page but not set during put_page, the _mapcount of the tail page
      would underflow.
      
      PageHeadHuge will remain true until the compound page is released and
      enters the buddy allocator so it won't risk to change even if the tail
      page is the last reference left on the page.
      
      PG_slab instead is cleared before the slab frees the head page with
      put_page, so if the tail pin is released after the slab freed the page,
      we would have a problem.  But in the slab case the tail pin cannot be
      the last reference left on the page.  This is because the slab code is
      free to reuse the compound page after a kfree/kmem_cache_free without
      having to check if there's any tail pin left.  In turn all tail pins
      must be always released while the head is still pinned by the slab code
      and so we know PG_slab will be still set too.
      Signed-off-by: 's avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: 's avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      44518d2b
    • Dave Hansen's avatar
      mm: hugetlbfs: Add some VM_BUG_ON()s to catch non-hugetlbfs pages · 0e147aed
      Dave Hansen authored
      Dave Jiang reported that he was seeing oopses when running NUMA systems
      and default_hugepagesz=1G.  I traced the issue down to
      migrate_page_copy() trying to use the same code for hugetlb pages and
      transparent hugepages.  It should not have been trying to pass thp pages
      in there.
      
      So, add some VM_BUG_ON()s for the next hapless VM developer that tries
      the same thing.
      Signed-off-by: 's avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: 's avatarDave Jiang <dave.jiang@intel.com>
      Acked-by: 's avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e147aed
  15. 13 Dec, 2013 1 commit
  16. 22 Nov, 2013 2 commits
    • Andrea Arcangeli's avatar
      mm: hugetlbfs: fix hugetlbfs optimization · 27c73ae7
      Andrea Arcangeli authored
      Commit 7cb2ef56 ("mm: fix aio performance regression for database
      caused by THP") can cause dereference of a dangling pointer if
      split_huge_page runs during PageHuge() if there are updates to the
      tail_page->private field.
      
      Also it is repeating compound_head twice for hugetlbfs and it is running
      compound_head+compound_trans_head for THP when a single one is needed in
      both cases.
      
      The new code within the PageSlab() check doesn't need to verify that the
      THP page size is never bigger than the smallest hugetlbfs page size, to
      avoid memory corruption.
      
      A longstanding theoretical race condition was found while fixing the
      above (see the change right after the skip_unlock label, that is
      relevant for the compound_lock path too).
      
      By re-establishing the _mapcount tail refcounting for all compound
      pages, this also fixes the below problem:
      
        echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
        BUG: Bad page state in process bash  pfn:59a01
        page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
        page flags: 0x1c00000000008000(tail)
        Modules linked in:
        CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x55/0x76
          bad_page+0xd5/0x130
          free_pages_prepare+0x213/0x280
          __free_pages+0x36/0x80
          update_and_free_page+0xc1/0xd0
          free_pool_huge_page+0xc2/0xe0
          set_max_huge_pages.part.58+0x14c/0x220
          nr_hugepages_store_common.isra.60+0xd0/0xf0
          nr_hugepages_store+0x13/0x20
          kobj_attr_store+0xf/0x20
          sysfs_write_file+0x189/0x1e0
          vfs_write+0xc5/0x1f0
          SyS_write+0x55/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: 's avatarKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: 's avatarAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: 's avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      27c73ae7
    • Dave Hansen's avatar
      mm: thp: give transparent hugepage code a separate copy_page · 30b0a105
      Dave Hansen authored
      Right now, the migration code in migrate_page_copy() uses copy_huge_page()
      for hugetlbfs and thp pages:
      
             if (PageHuge(page) || PageTransHuge(page))
                      copy_huge_page(newpage, page);
      
      So, yay for code reuse.  But:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
      
      and a non-hugetlbfs page has no page_hstate().  This works 99% of the
      time because page_hstate() determines the hstate from the page order
      alone.  Since the page order of a THP page matches the default hugetlbfs
      page order, it works.
      
      But, if you change the default huge page size on the boot command-line
      (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
      so page_hstate() returns null and copy_huge_page() oopses pretty fast
      since copy_huge_page() dereferences the hstate:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
              if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
        ...
      
      Mel noticed that the migration code is really the only user of these
      functions.  This moves all the copy code over to migrate.c and makes
      copy_huge_page() work for THP by checking for it explicitly.
      
      I believe the bug was introduced in commit b32967ff ("mm: numa: Add
      THP migration for the NUMA working set scanning fault case")
      
      [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
      Signed-off-by: 's avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: 's avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Tested-by: 's avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      30b0a105
  17. 15 Nov, 2013 1 commit
    • Kirill A. Shutemov's avatar
      mm, hugetlb: convert hugetlbfs to use split pmd lock · cb900f41
      Kirill A. Shutemov authored
      Hugetlb supports multiple page sizes. We use split lock only for PMD
      level, but not for PUD.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: 's avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: 's avatarAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb900f41
  18. 11 Sep, 2013 4 commits
    • Naoya Horiguchi's avatar
      mm: migrate: check movability of hugepage in unmap_and_move_huge_page() · 83467efb
      Naoya Horiguchi authored
      Currently hugepage migration works well only for pmd-based hugepages
      (mainly due to lack of testing,) so we had better not enable migration of
      other levels of hugepages until we are ready for it.
      
      Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
      page table walk and check pud/pmd_huge() there, so they are safe.  But the
      other users (softoffline and memory hotremove) don't do this, so without
      this patch they can try to migrate unexpected types of hugepages.
      
      To prevent this, we introduce hugepage_migration_support() as an
      architecture dependent check of whether hugepage are implemented on a pmd
      basis or not.  And on some architecture multiple sizes of hugepages are
      available, so hugepage_migration_support() also checks hugepage size.
      Signed-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      83467efb
    • Naoya Horiguchi's avatar
      mm: memory-hotplug: enable memory hotplug to handle hugepage · c8721bbb
      Naoya Horiguchi authored
      Until now we can't offline memory blocks which contain hugepages because a
      hugepage is considered as an unmovable page.  But now with this patch
      series, a hugepage has become movable, so by using hugepage migration we
      can offline such memory blocks.
      
      What's different from other users of hugepage migration is that we need to
      decompose all the hugepages inside the target memory block into free buddy
      pages after hugepage migration, because otherwise free hugepages remaining
      in the memory block intervene the memory offlining.  For this reason we
      introduce new functions dissolve_free_huge_page() and
      dissolve_free_huge_pages().
      
      Other than that, what this patch does is straightforwardly to add hugepage
      migration code, that is, adding hugepage code to the functions which scan
      over pfn and collect hugepages to be migrated, and adding a hugepage
      allocation function to alloc_migrate_target().
      
      As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
      over them because it's larger than memory block.  So we now simply leave
      it to fail as it is.
      
      [yongjun_wei@trendmicro.com.cn: remove duplicated include]
      Signed-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: 's avatarAndi Kleen <ak@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8721bbb
    • Naoya Horiguchi's avatar
      mm: mbind: add hugepage migration code to mbind() · 74060e4d
      Naoya Horiguchi authored
      Extend do_mbind() to handle vma with VM_HUGETLB set.  We will be able to
      migrate hugepage with mbind(2) after applying the enablement patch which
      comes later in this series.
      Signed-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: 's avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: 's avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: 's avatarHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      74060e4d
    • Naoya Horiguchi's avatar
      mm: migrate: make core migration code aware of hugepage · 31caf665
      Naoya Horiguchi authored
      Currently hugepage migration is available only for soft offlining, but
      it's also useful for some other users of page migration (clearly because
      users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
      So this patchset tries to extend such users to support hugepage migration.
      
      The target of this patchset is to enable hugepage migration for NUMA
      related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
      memory hotplug.
      
      This patchset does not add hugepage migration for memory compaction,
      because users of memory compaction mainly expect to construct thp by
      arranging raw pages, and there's little or no need to compact hugepages.
      CMA, another user of page migration, can have benefit from hugepage
      migration, but is not enabled to support it for now (just because of lack
      of testing and expertise in CMA.)
      
      Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
      x86_64, or hugepages in architectures like ia64) is not enabled for now
      (again, because of lack of testing.)
      
      As for how these are achived, I extended the API (migrate_pages()) to
      handle hugepage (with patch 1 and 2) and adjusted code of each caller to
      check and collect movable hugepages (with patch 3-7).  Remaining 2 patches
      are kind of miscellaneous ones to avoid unexpected behavior.  Patch 8 is
      about making sure that we only migrate pmd-based hugepages.  And patch 9
      is about choosing appropriate zone for hugepage allocation.
      
      My test is mainly functional one, simply kicking hugepage migration via
      each entry point and confirm that migration is done correctly.  Test code
      is available here:
      
        git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git
      
      And I always run libhugetlbfs test when changing hugetlbfs's code.  With
      this patchset, no regression was found in the test.
      
      This patch (of 9):
      
      Before enabling each user of page migration to support hugepage,
      this patch enables the list of pages for migration to link not only
      LRU pages, but also hugepages. As a result, putback_movable_pages()
      and migrate_pages() can handle both of LRU pages and hugepages.
      Signed-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: 's avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: 's avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: 's avatarHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      31caf665
  19. 03 Jul, 2013 1 commit
  20. 25 Jun, 2013 1 commit
    • Zhang Yi's avatar
      futex: Take hugepages into account when generating futex_key · 13d60f4b
      Zhang Yi authored
      The futex_keys of process shared futexes are generated from the page
      offset, the mapping host and the mapping index of the futex user space
      address. This should result in an unique identifier for each futex.
      
      Though this is not true when futexes are located in different subpages
      of an hugepage. The reason is, that the mapping index for all those
      futexes evaluates to the index of the base page of the hugetlbfs
      mapping. So a futex at offset 0 of the hugepage mapping and another
      one at offset PAGE_SIZE of the same hugepage mapping have identical
      futex_keys. This happens because the futex code blindly uses
      page->index.
      
      Steps to reproduce the bug:
      
      1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
         and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
         mapping.
      
         The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
         PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
         their keys solely depend on the user space address.
      
      2. Lock mutex1 and mutex2
      
      3. Create thread1 and in the thread function lock mutex1, which
         results in thread1 blocking on the locked mutex1.
      
      4. Create thread2 and in the thread function lock mutex2, which
         results in thread2 blocking on the locked mutex2.
      
      5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
         still blocks on mutex2 because the futex_key points to mutex1.
      
      To solve this issue we need to take the normal page index of the page
      which contains the futex into account, if the futex is in an hugetlbfs
      mapping. In other words, we calculate the normal page mapping index of
      the subpage in the hugetlbfs mapping.
      
      Mappings which are not based on hugetlbfs are not affected and still
      use page->index.
      
      Thanks to Mel Gorman who provided a patch for adding proper evaluation
      functions to the hugetlbfs code to avoid exposing hugetlbfs specific
      details to the futex code.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: 's avatarZhang Yi <zhang.yi20@zte.com.cn>
      Reviewed-by: 's avatarJiang Biao <jiang.biao2@zte.com.cn>
      Tested-by: 's avatarMa Chenggong <ma.chenggong@zte.com.cn>
      Reviewed-by: 's avatar'Mel Gorman' <mgorman@suse.de>
      Acked-by: 's avatar'Darren Hart' <dvhart@linux.intel.com>
      Cc: 'Peter Zijlstra' <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@comSigned-off-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      13d60f4b
  21. 14 Jun, 2013 1 commit
  22. 08 May, 2013 1 commit
  23. 29 Apr, 2013 1 commit
    • David Rientjes's avatar
      mm, hugetlb: include hugepages in meminfo · 949f7ec5
      David Rientjes authored
      Particularly in oom conditions, it's troublesome that hugetlb memory is
      not displayed.  All other meminfo that is emitted will not add up to
      what is expected, and there is no artifact left in the kernel log to
      show that a potentially significant amount of memory is actually
      allocated as hugepages which are not available to be reclaimed.
      
      Booting with hugepages=8192 on the command line, this memory is now
      shown in oom conditions.  For example, with echo m >
      /proc/sysrq-trigger:
      
        Node 0 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 1 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 2 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 3 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: 's avatarDavid Rientjes <rientjes@google.com>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      949f7ec5
  24. 24 Feb, 2013 1 commit
  25. 23 Feb, 2013 1 commit