1. 07 Aug, 2014 1 commit
  2. 31 Jul, 2014 1 commit
    • Atsushi Kumagai's avatar
      kexec: export free_huge_page to VMCOREINFO · 8f1d26d0
      Atsushi Kumagai authored
      PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56b
      
      
      ("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
      another symbol to filter *hugetlbfs* pages.
      
      If a user hope to filter user pages, makedumpfile tries to exclude them by
      checking the condition whether the page is anonymous, but hugetlbfs pages
      aren't anonymous while they also be user pages.
      
      We know it's possible to detect them in the same way as PageHuge(),
      so we need the start address of free_huge_page():
      
          int PageHuge(struct page *page)
          {
                  if (!PageCompound(page))
                          return 0;
      
                  page = compound_head(page);
                  return get_compound_page_dtor(page) == free_huge_page;
          }
      
      For that reason, this patch changes free_huge_page() into public
      to export it to VMCOREINFO.
      Signed-off-by: default avatarAtsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f1d26d0
  3. 04 Jun, 2014 3 commits
  4. 06 May, 2014 1 commit
    • Nishanth Aravamudan's avatar
      hugetlb: ensure hugepage access is denied if hugepages are not supported · 457c1b27
      Nishanth Aravamudan authored
      
      
      Currently, I am seeing the following when I `mount -t hugetlbfs /none
      /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
      related to the fact that hugetlbfs is properly not correctly setting
      itself up in this state?:
      
        Unable to handle kernel paging request for data at address 0x00000031
        Faulting instruction address: 0xc000000000245710
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        ....
      
      In KVM guests on Power, in a guest not backed by hugepages, we see the
      following:
      
        AnonHugePages:         0 kB
        HugePages_Total:       0
        HugePages_Free:        0
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:         64 kB
      
      HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
      are not supported at boot-time, but this is only checked in
      hugetlb_init().  Extract the check to a helper function, and use it in a
      few relevant places.
      
      This does make hugetlbfs not supported (not registered at all) in this
      environment.  I believe this is fine, as there are no valid hugepages
      and that won't change at runtime.
      
      [akpm@linux-foundation.org: use pr_info(), per Mel]
      [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      457c1b27
  5. 03 Apr, 2014 2 commits
  6. 24 Jan, 2014 1 commit
  7. 22 Jan, 2014 2 commits
    • Andrea Arcangeli's avatar
      mm: tail page refcounting optimization for slab and hugetlbfs · 44518d2b
      Andrea Arcangeli authored
      
      
      This skips the _mapcount mangling for slab and hugetlbfs pages.
      
      The main trouble in doing this is to guarantee that PageSlab and
      PageHeadHuge remains constant for all get_page/put_page run on the tail
      of slab or hugetlbfs compound pages.  Otherwise if they're set during
      get_page but not set during put_page, the _mapcount of the tail page
      would underflow.
      
      PageHeadHuge will remain true until the compound page is released and
      enters the buddy allocator so it won't risk to change even if the tail
      page is the last reference left on the page.
      
      PG_slab instead is cleared before the slab frees the head page with
      put_page, so if the tail pin is released after the slab freed the page,
      we would have a problem.  But in the slab case the tail pin cannot be
      the last reference left on the page.  This is because the slab code is
      free to reuse the compound page after a kfree/kmem_cache_free without
      having to check if there's any tail pin left.  In turn all tail pins
      must be always released while the head is still pinned by the slab code
      and so we know PG_slab will be still set too.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44518d2b
    • Dave Hansen's avatar
      mm: hugetlbfs: Add some VM_BUG_ON()s to catch non-hugetlbfs pages · 0e147aed
      Dave Hansen authored
      
      
      Dave Jiang reported that he was seeing oopses when running NUMA systems
      and default_hugepagesz=1G.  I traced the issue down to
      migrate_page_copy() trying to use the same code for hugetlb pages and
      transparent hugepages.  It should not have been trying to pass thp pages
      in there.
      
      So, add some VM_BUG_ON()s for the next hapless VM developer that tries
      the same thing.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: default avatarDave Jiang <dave.jiang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e147aed
  8. 13 Dec, 2013 1 commit
  9. 22 Nov, 2013 2 commits
    • Andrea Arcangeli's avatar
      mm: hugetlbfs: fix hugetlbfs optimization · 27c73ae7
      Andrea Arcangeli authored
      Commit 7cb2ef56
      
       ("mm: fix aio performance regression for database
      caused by THP") can cause dereference of a dangling pointer if
      split_huge_page runs during PageHuge() if there are updates to the
      tail_page->private field.
      
      Also it is repeating compound_head twice for hugetlbfs and it is running
      compound_head+compound_trans_head for THP when a single one is needed in
      both cases.
      
      The new code within the PageSlab() check doesn't need to verify that the
      THP page size is never bigger than the smallest hugetlbfs page size, to
      avoid memory corruption.
      
      A longstanding theoretical race condition was found while fixing the
      above (see the change right after the skip_unlock label, that is
      relevant for the compound_lock path too).
      
      By re-establishing the _mapcount tail refcounting for all compound
      pages, this also fixes the below problem:
      
        echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
        BUG: Bad page state in process bash  pfn:59a01
        page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
        page flags: 0x1c00000000008000(tail)
        Modules linked in:
        CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x55/0x76
          bad_page+0xd5/0x130
          free_pages_prepare+0x213/0x280
          __free_pages+0x36/0x80
          update_and_free_page+0xc1/0xd0
          free_pool_huge_page+0xc2/0xe0
          set_max_huge_pages.part.58+0x14c/0x220
          nr_hugepages_store_common.isra.60+0xd0/0xf0
          nr_hugepages_store+0x13/0x20
          kobj_attr_store+0xf/0x20
          sysfs_write_file+0x189/0x1e0
          vfs_write+0xc5/0x1f0
          SyS_write+0x55/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27c73ae7
    • Dave Hansen's avatar
      mm: thp: give transparent hugepage code a separate copy_page · 30b0a105
      Dave Hansen authored
      Right now, the migration code in migrate_page_copy() uses copy_huge_page()
      for hugetlbfs and thp pages:
      
             if (PageHuge(page) || PageTransHuge(page))
                      copy_huge_page(newpage, page);
      
      So, yay for code reuse.  But:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
      
      and a non-hugetlbfs page has no page_hstate().  This works 99% of the
      time because page_hstate() determines the hstate from the page order
      alone.  Since the page order of a THP page matches the default hugetlbfs
      page order, it works.
      
      But, if you change the default huge page size on the boot command-line
      (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
      so page_hstate() returns null and copy_huge_page() oopses pretty fast
      since copy_huge_page() dereferences the hstate:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
              if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
        ...
      
      Mel noticed that the migration code is really the only user of these
      functions.  This moves all the copy code over to migrate.c and makes
      copy_huge_page() work for THP by checking for it explicitly.
      
      I believe the bug was introduced in commit b32967ff
      
       ("mm: numa: Add
      THP migration for the NUMA working set scanning fault case")
      
      [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Tested-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30b0a105
  10. 15 Nov, 2013 1 commit
    • Kirill A. Shutemov's avatar
      mm, hugetlb: convert hugetlbfs to use split pmd lock · cb900f41
      Kirill A. Shutemov authored
      
      
      Hugetlb supports multiple page sizes. We use split lock only for PMD
      level, but not for PUD.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb900f41
  11. 11 Sep, 2013 4 commits
    • Naoya Horiguchi's avatar
      mm: migrate: check movability of hugepage in unmap_and_move_huge_page() · 83467efb
      Naoya Horiguchi authored
      
      
      Currently hugepage migration works well only for pmd-based hugepages
      (mainly due to lack of testing,) so we had better not enable migration of
      other levels of hugepages until we are ready for it.
      
      Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
      page table walk and check pud/pmd_huge() there, so they are safe.  But the
      other users (softoffline and memory hotremove) don't do this, so without
      this patch they can try to migrate unexpected types of hugepages.
      
      To prevent this, we introduce hugepage_migration_support() as an
      architecture dependent check of whether hugepage are implemented on a pmd
      basis or not.  And on some architecture multiple sizes of hugepages are
      available, so hugepage_migration_support() also checks hugepage size.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83467efb
    • Naoya Horiguchi's avatar
      mm: memory-hotplug: enable memory hotplug to handle hugepage · c8721bbb
      Naoya Horiguchi authored
      
      
      Until now we can't offline memory blocks which contain hugepages because a
      hugepage is considered as an unmovable page.  But now with this patch
      series, a hugepage has become movable, so by using hugepage migration we
      can offline such memory blocks.
      
      What's different from other users of hugepage migration is that we need to
      decompose all the hugepages inside the target memory block into free buddy
      pages after hugepage migration, because otherwise free hugepages remaining
      in the memory block intervene the memory offlining.  For this reason we
      introduce new functions dissolve_free_huge_page() and
      dissolve_free_huge_pages().
      
      Other than that, what this patch does is straightforwardly to add hugepage
      migration code, that is, adding hugepage code to the functions which scan
      over pfn and collect hugepages to be migrated, and adding a hugepage
      allocation function to alloc_migrate_target().
      
      As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
      over them because it's larger than memory block.  So we now simply leave
      it to fail as it is.
      
      [yongjun_wei@trendmicro.com.cn: remove duplicated include]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8721bbb
    • Naoya Horiguchi's avatar
      mm: mbind: add hugepage migration code to mbind() · 74060e4d
      Naoya Horiguchi authored
      
      
      Extend do_mbind() to handle vma with VM_HUGETLB set.  We will be able to
      migrate hugepage with mbind(2) after applying the enablement patch which
      comes later in this series.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74060e4d
    • Naoya Horiguchi's avatar
      mm: migrate: make core migration code aware of hugepage · 31caf665
      Naoya Horiguchi authored
      Currently hugepage migration is available only for soft offlining, but
      it's also useful for some other users of page migration (clearly because
      users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
      So this patchset tries to extend such users to support hugepage migration.
      
      The target of this patchset is to enable hugepage migration for NUMA
      related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
      memory hotplug.
      
      This patchset does not add hugepage migration for memory compaction,
      because users of memory compaction mainly expect to construct thp by
      arranging raw pages, and there's little or no need to compact hugepages.
      CMA, another user of page migration, can have benefit from hugepage
      migration, but is not enabled to support it for now (just because of lack
      of testing and expertise in CMA.)
      
      Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
      x86_64, or hugepages in architectures like ia64) is not enabled for now
      (again, because of lack of testing.)
      
      As for how these are achived, I extended the API (migrate_pages()) to
      handle hugepage (with patch 1 and 2) and adjusted code of each caller to
      check and collect movable hugepages (with patch 3-7).  Remaining 2 patches
      are kind of miscellaneous ones to avoid unexpected behavior.  Patch 8 is
      about making sure that we only migrate pmd-based hugepages.  And patch 9
      is about choosing appropriate zone for hugepage allocation.
      
      My test is mainly functional one, simply kicking hugepage migration via
      each entry point and confirm that migration is done correctly.  Test code
      is available here:
      
        git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git
      
      
      
      And I always run libhugetlbfs test when changing hugetlbfs's code.  With
      this patchset, no regression was found in the test.
      
      This patch (of 9):
      
      Before enabling each user of page migration to support hugepage,
      this patch enables the list of pages for migration to link not only
      LRU pages, but also hugepages. As a result, putback_movable_pages()
      and migrate_pages() can handle both of LRU pages and hugepages.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31caf665
  12. 03 Jul, 2013 1 commit
  13. 25 Jun, 2013 1 commit
    • Zhang Yi's avatar
      futex: Take hugepages into account when generating futex_key · 13d60f4b
      Zhang Yi authored
      
      
      The futex_keys of process shared futexes are generated from the page
      offset, the mapping host and the mapping index of the futex user space
      address. This should result in an unique identifier for each futex.
      
      Though this is not true when futexes are located in different subpages
      of an hugepage. The reason is, that the mapping index for all those
      futexes evaluates to the index of the base page of the hugetlbfs
      mapping. So a futex at offset 0 of the hugepage mapping and another
      one at offset PAGE_SIZE of the same hugepage mapping have identical
      futex_keys. This happens because the futex code blindly uses
      page->index.
      
      Steps to reproduce the bug:
      
      1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
         and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
         mapping.
      
         The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
         PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
         their keys solely depend on the user space address.
      
      2. Lock mutex1 and mutex2
      
      3. Create thread1 and in the thread function lock mutex1, which
         results in thread1 blocking on the locked mutex1.
      
      4. Create thread2 and in the thread function lock mutex2, which
         results in thread2 blocking on the locked mutex2.
      
      5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
         still blocks on mutex2 because the futex_key points to mutex1.
      
      To solve this issue we need to take the normal page index of the page
      which contains the futex into account, if the futex is in an hugetlbfs
      mapping. In other words, we calculate the normal page mapping index of
      the subpage in the hugetlbfs mapping.
      
      Mappings which are not based on hugetlbfs are not affected and still
      use page->index.
      
      Thanks to Mel Gorman who provided a patch for adding proper evaluation
      functions to the hugetlbfs code to avoid exposing hugetlbfs specific
      details to the futex code.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarZhang Yi <zhang.yi20@zte.com.cn>
      Reviewed-by: default avatarJiang Biao <jiang.biao2@zte.com.cn>
      Tested-by: default avatarMa Chenggong <ma.chenggong@zte.com.cn>
      Reviewed-by: default avatar'Mel Gorman' <mgorman@suse.de>
      Acked-by: default avatar'Darren Hart' <dvhart@linux.intel.com>
      Cc: 'Peter Zijlstra' <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      13d60f4b
  14. 14 Jun, 2013 1 commit
  15. 08 May, 2013 1 commit
  16. 29 Apr, 2013 1 commit
    • David Rientjes's avatar
      mm, hugetlb: include hugepages in meminfo · 949f7ec5
      David Rientjes authored
      
      
      Particularly in oom conditions, it's troublesome that hugetlb memory is
      not displayed.  All other meminfo that is emitted will not add up to
      what is expected, and there is no artifact left in the kernel log to
      show that a potentially significant amount of memory is actually
      allocated as hugepages which are not available to be reclaimed.
      
      Booting with hugepages=8192 on the command line, this memory is now
      shown in oom conditions.  For example, with echo m >
      /proc/sysrq-trigger:
      
        Node 0 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 1 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 2 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 3 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      949f7ec5
  17. 24 Feb, 2013 1 commit
  18. 23 Feb, 2013 1 commit
  19. 12 Dec, 2012 1 commit
    • Andi Kleen's avatar
      mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB · 42d7395f
      Andi Kleen authored
      
      
      There was some desire in large applications using MAP_HUGETLB or
      SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
      others.  This is useful together with NUMA policy: use 2MB interleaving
      on some mappings, but 1GB on local mappings.
      
      This patch extends the IPC/SHM syscall interfaces slightly to allow
      specifying the page size.
      
      It borrows some upper bits in the existing flag arguments and allows
      encoding the log of the desired page size in addition to the *_HUGETLB
      flag.  When 0 is specified the default size is used, this makes the
      change fully compatible.
      
      Extending the internal hugetlb code to handle this is straight forward.
      Instead of a single mount it just keeps an array of them and selects the
      right mount based on the specified page size.  When no page size is
      specified it uses the mount of the default page size.
      
      The change is not visible in /proc/mounts because internal mounts don't
      appear there.  It also has very little overhead: the additional mounts
      just consume a super block, but not more memory when not used.
      
      I also exported the new flags to the user headers (they were previously
      under __KERNEL__).  Right now only symbols for x86 and some other
      architecture for 1GB and 2MB are defined.  The interface should already
      work for all other architectures though.  Only architectures that define
      multiple hugetlb sizes actually need it (that is currently x86, tile,
      powerpc).  However tile and powerpc have user configurable hugetlb
      sizes, so it's not easy to add defines.  A program on those
      architectures would need to query sysfs and use the appropiate log2.
      
      [akpm@linux-foundation.org: cleanups]
      [rientjes@google.com: fix build]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      42d7395f
  20. 11 Dec, 2012 1 commit
  21. 01 Aug, 2012 6 commits
    • Mel Gorman's avatar
      mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables · d833352a
      Mel Gorman authored
      If a process creates a large hugetlbfs mapping that is eligible for page
      table sharing and forks heavily with children some of whom fault and
      others which destroy the mapping then it is possible for page tables to
      get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
      output a message to the kernel log.  The final teardown will trigger a
      BUG_ON in mm/filemap.c.
      
      This was reproduced in 3.4 but is known to have existed for a long time
      and goes back at least as far as 2.6.37.  It was probably was introduced
      in 2.6.20 by [39dde65c
      
      : shared page table for hugetlb page].  The messages
      look like this;
      
      [  ..........] Lots of bad pmd messages followed by this
      [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
      [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
      [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
      [  127.186778] ------------[ cut here ]------------
      [  127.186781] kernel BUG at mm/filemap.c:134!
      [  127.186782] invalid opcode: 0000 [#1] SMP
      [  127.186783] CPU 7
      [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
      [  127.186801]
      [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
      [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
      [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
      [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
      [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
      [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
      [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
      [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
      [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
      [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
      [  127.186821] Stack:
      [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
      [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
      [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
      [  127.186827] Call Trace:
      [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
      [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
      [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
      [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
      [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
      [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
      [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
      [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
      [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
      [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
      [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
      [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
      [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
      [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
      [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
      [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
      [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186870]  RSP <ffff8804144b5c08>
      [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
      
      The bug is a race and not always easy to reproduce.  To reproduce it I was
      doing the following on a single socket I7-based machine with 16G of RAM.
      
      $ hugeadm --pool-pages-max DEFAULT:13G
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
      $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
      
      On my particular machine, it usually triggers within 10 minutes but
      enabling debug options can change the timing such that it never hits.
      Once the bug is triggered, the machine is in trouble and needs to be
      rebooted.  The machine will respond but processes accessing proc like "ps
      aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
      hard reset or a sysrq-b.
      
      The basic problem is a race between page table sharing and teardown.  For
      the most part page table sharing depends on i_mmap_mutex.  In some cases,
      it is also taking the mm->page_table_lock for the PTE updates but with
      shared page tables, it is the i_mmap_mutex that is more important.
      
      Unfortunately it appears to be also insufficient. Consider the following
      situation
      
      Process A					Process B
      ---------					---------
      hugetlb_fault					shmdt
        						LockWrite(mmap_sem)
          						  do_munmap
      						    unmap_region
      						      unmap_vmas
      						        unmap_single_vma
      						          unmap_hugepage_range
            						            Lock(i_mmap_mutex)
      							    Lock(mm->page_table_lock)
      							    huge_pmd_unshare/unmap tables <--- (1)
      							    Unlock(mm->page_table_lock)
            						            Unlock(i_mmap_mutex)
        huge_pte_alloc				      ...
          Lock(i_mmap_mutex)				      ...
          vma_prio_walk, find svma, spte		      ...
          Lock(mm->page_table_lock)			      ...
          share spte					      ...
          Unlock(mm->page_table_lock)			      ...
          Unlock(i_mmap_mutex)			      ...
        hugetlb_no_page									  <--- (2)
      						      free_pgtables
      						        unlink_file_vma
      							hugetlb_free_pgd_range
      						    remove_vma_list
      
      In this scenario, it is possible for Process A to share page tables with
      Process B that is trying to tear them down.  The i_mmap_mutex on its own
      does not prevent Process A walking Process B's page tables.  At (1) above,
      the page tables are not shared yet so it unmaps the PMDs.  Process A sets
      up page table sharing and at (2) faults a new entry.  Process B then trips
      up on it in free_pgtables.
      
      This patch fixes the problem by adding a new function
      __unmap_hugepage_range_final that is only called when the VMA is about to
      be destroyed.  This function clears VM_MAYSHARE during
      unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
      ineligible for sharing and avoids the race.  Superficially this looks like
      it would then be vunerable to truncate and madvise issues but hugetlbfs
      has its own truncate handlers so does not use unmap_mapping_range() and
      does not support madvise(DONTNEED).
      
      This should be treated as a -stable candidate if it is merged.
      
      Test program is as follows. The test case was mostly written by Michal
      Hocko with a few minor changes to reproduce this bug.
      
      ==== CUT HERE ====
      
      static size_t huge_page_size = (2UL << 20);
      static size_t nr_huge_page_A = 512;
      static size_t nr_huge_page_B = 5632;
      
      unsigned int get_random(unsigned int max)
      {
      	struct timeval tv;
      
      	gettimeofday(&tv, NULL);
      	srandom(tv.tv_usec);
      	return random() % max;
      }
      
      static void play(void *addr, size_t size)
      {
      	unsigned char *start = addr,
      		      *end = start + size,
      		      *a;
      	start += get_random(size/2);
      
      	/* we could itterate on huge pages but let's give it more time. */
      	for (a = start; a < end; a += 4096)
      		*a = 0;
      }
      
      int main(int argc, char **argv)
      {
      	key_t key = IPC_PRIVATE;
      	size_t sizeA = nr_huge_page_A * huge_page_size;
      	size_t sizeB = nr_huge_page_B * huge_page_size;
      	int shmidA, shmidB;
      	void *addrA = NULL, *addrB = NULL;
      	int nr_children = 300, n = 0;
      
      	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      
      fork_child:
      	switch(fork()) {
      		case 0:
      			switch (n%3) {
      			case 0:
      				play(addrA, sizeA);
      				break;
      			case 1:
      				play(addrB, sizeB);
      				break;
      			case 2:
      				break;
      			}
      			break;
      		case -1:
      			perror("fork:");
      			break;
      		default:
      			if (++n < nr_children)
      				goto fork_child;
      			play(addrA, sizeA);
      			break;
      	}
      	shmdt(addrA);
      	shmdt(addrB);
      	do {
      		wait(NULL);
      	} while (--n > 0);
      	shmctl(shmidA, IPC_RMID, NULL);
      	shmctl(shmidB, IPC_RMID, NULL);
      	return 0;
      }
      
      [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d833352a
    • Aneesh Kumar K.V's avatar
      hugetlb/cgroup: add hugetlb cgroup control files · abb8206c
      Aneesh Kumar K.V authored
      
      
      Add the control files for hugetlb controller
      
      [akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g]
      [akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/]
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abb8206c
    • Aneesh Kumar K.V's avatar
      hugetlb: make some static variables global · c3f38a38
      Aneesh Kumar K.V authored
      
      
      We will use them later in hugetlb_cgroup.c
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3f38a38
    • Aneesh Kumar K.V's avatar
      hugetlb: add a list for tracking in-use HugeTLB pages · 0edaecfa
      Aneesh Kumar K.V authored
      
      
      hugepage_activelist will be used to track currently used HugeTLB pages.
      We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal.
      On cgroup removal we update the page's HugeTLB cgroup to point to parent
      cgroup.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0edaecfa
    • Aneesh Kumar K.V's avatar
      hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages · 24669e58
      Aneesh Kumar K.V authored
      
      
      Use a mmu_gather instead of a temporary linked list for accumulating pages
      when we unmap a hugepage range
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24669e58
    • Aneesh Kumar K.V's avatar
      hugetlb: add an inline helper for finding hstate index · 972dc4de
      Aneesh Kumar K.V authored
      
      
      Add an inline helper and use it in the code.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      972dc4de
  22. 25 May, 2012 1 commit
  23. 22 Mar, 2012 3 commits
    • Steven Truelove's avatar
      hugetlbfs: fix alignment of huge page requests · 40716e29
      Steven Truelove authored
      
      
      When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
      PAGE_SIZE, but this is not sufficient.
      
      Modify hugetlb_file_setup() to align requests to the huge page size, and
      to accept an address argument so that all alignment checks can be
      performed in hugetlb_file_setup(), rather than in its callers.  Change
      newseg() and mmap_pgoff() to match the new prototype and eliminate a now
      redundant alignment check.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarSteven Truelove <steven.truelove@utoronto.ca>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40716e29
    • David Gibson's avatar
      hugepages: fix use after free bug in "quota" handling · 90481622
      David Gibson authored
      hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
      general quota handling code, and they don't much resemble its behaviour.
      Rather than being about maintaining limits on on-disk block usage by
      particular users, they are instead about maintaining limits on in-memory
      page usage (including anonymous MAP_PRIVATE copied-on-write pages)
      associated with a particular hugetlbfs filesystem instance.
      
      Worse, they work by having callbacks to the hugetlbfs filesystem code from
      the low-level page handling code, in particular from free_huge_page().
      This is a layering violation of itself, but more importantly, if the
      kernel does a get_user_pages() on hugepages (which can happen from KVM
      amongst others), then the free_huge_page() can be delayed until after the
      associated inode has already been freed.  If an unmount occurs at the
      wrong time, even the hugetlbfs superblock where the "quota" limits are
      stored may have been freed.
      
      Andrew Barry proposed a patch to fix this by having hugepages, instead of
      storing a pointer to their address_space and reaching the superblock from
      there, had the hugepages store pointers directly to the superblock,
      bumping the reference count as appropriate to avoid it being freed.
      Andrew Morton rejected that version, however, on the grounds that it made
      the existing layering violation worse.
      
      This is a reworked version of Andrew's patch, which removes the extra, and
      some of the existing, layering violation.  It works by introducing the
      concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
      finite logical pool of hugepages to allocate from.  hugetlbfs now creates
      a subpool for each filesystem instance with a page limit set, and a
      pointer to the subpool gets added to each allocated hugepage, instead of
      the address_space pointer used now.  The subpool has its own lifetime and
      is only freed once all pages in it _and_ all other references to it (i.e.
      superblocks) are gone.
      
      subpools are optional - a NULL subpool pointer is taken by the code to
      mean that no subpool limits are in effect.
      
      Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
      quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
      http://marc.info/?l=linux-mm&m=126928970510627&w=1
      
      
      
      v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
      alloc_huge_page() - since it already takes the vma, it is not necessary.
      Signed-off-by: default avatarAndrew Barry <abarry@cray.com>
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90481622
    • David Gibson's avatar
      hugetlb: cleanup hugetlb.h · a1d776ee
      David Gibson authored
      
      
      Make a couple of small cleanups to linux/include/hugetlb.h.  The
      set_file_hugepages() function, which was not used anywhere is removed,
      and the hugetlbfs_config and hugetlbfs_inode_info structures with its
      HUGETLBFS_I helper function are moved into inode.c, the only place they
      were used.
      
      These structures are really linked to the hugetlbfs filesystem
      specifically not to hugepage mm handling in general, so they belong in
      the filesystem code not in a generally available header.
      
      It would be nice to move the hugetlbfs_sb_info (superblock) structure in
      there as well, but it's currently needed in a number of places via the
      hstate_vma() and hstate_inode().
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Andrew Barry <abarry@cray.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1d776ee
  24. 19 Nov, 2011 1 commit
    • David Rientjes's avatar
      hugetlb: remove dummy definitions of HPAGE_MASK and HPAGE_SIZE · a5c86e98
      David Rientjes authored
      Dummy, non-zero definitions for HPAGE_MASK and HPAGE_SIZE were added in
      51c6f666
      
       ("mm: ZAP_BLOCK causes redundant work") to avoid a divide
      by zero in generic kernel code.
      
      That code has since been removed, but probably should never have been
      added in the first place: we don't want HPAGE_SIZE to act like PAGE_SIZE
      for code that is working with hugepages, for example, when the
      dependency on CONFIG_HUGETLB_PAGE has not been fulfilled.
      
      Because hugepage size can differ from architecture to architecture, each
      is required to have their own definitions for both HPAGE_MASK and
      HPAGE_SIZE.  This is always done in arch/*/include/asm/page.h.
      
      So, just remove the dummy and dangerous definitions since they are no
      longer needed and reveals the correct dependencies.  Tested on
      architectures using the definitions with allyesconfig: x86 (even with
      thp), hppa, mips, powerpc, s390, sh3, sh4, sparc, and sparc64, and with
      defconfig on ia64.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5c86e98
  25. 26 Jul, 2011 1 commit