1. 14 Jun, 2013 1 commit
  2. 29 Apr, 2013 2 commits
    • David Rientjes's avatar
      mm, hugetlb: include hugepages in meminfo · 949f7ec5
      David Rientjes authored
      
      
      Particularly in oom conditions, it's troublesome that hugetlb memory is
      not displayed.  All other meminfo that is emitted will not add up to
      what is expected, and there is no artifact left in the kernel log to
      show that a potentially significant amount of memory is actually
      allocated as hugepages which are not available to be reclaimed.
      
      Booting with hugepages=8192 on the command line, this memory is now
      shown in oom conditions.  For example, with echo m >
      /proc/sysrq-trigger:
      
        Node 0 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 1 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 2 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 3 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      949f7ec5
    • Gerald Schaefer's avatar
      mm/hugetlb: add more arch-defined huge_pte functions · 106c992a
      Gerald Schaefer authored
      Commit abf09bed
      
       ("s390/mm: implement software dirty bits")
      introduced another difference in the pte layout vs.  the pmd layout on
      s390, thoroughly breaking the s390 support for hugetlbfs.  This requires
      replacing some more pte_xxx functions in mm/hugetlbfs.c with a
      huge_pte_xxx version.
      
      This patch introduces those huge_pte_xxx functions and their generic
      implementation in asm-generic/hugetlb.h, which will now be included on
      all architectures supporting hugetlbfs apart from s390.  This change
      will be a no-op for those architectures.
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Acked-by: Michal Hocko <mhocko@suse.cz>	[for !s390 parts]
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      106c992a
  3. 17 Apr, 2013 1 commit
    • Naoya Horiguchi's avatar
      hugetlbfs: add swap entry check in follow_hugetlb_page() · 9cc3a5bd
      Naoya Horiguchi authored
      
      
      With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in
      initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory
      error happens on a hugepage and the affected processes try to access the
      error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0) in
      get_page().
      
      The reason for this bug is that coredump-related code doesn't recognise
      "hugepage hwpoison entry" with which a pmd entry is replaced when a memory
      error occurs on a hugepage.
      
      In other words, physical address information is stored in different bit
      layout between hugepage hwpoison entry and pmd entry, so
      follow_hugetlb_page() which is called in get_dump_page() returns a wrong
      page from a given address.
      
      The expected behavior is like this:
      
        absent   is_swap_pte   FOLL_DUMP   Expected behavior
        -------------------------------------------------------------------
         true     false         false       hugetlb_fault
         false    true          false       hugetlb_fault
         false    false         false       return page
         true     false         true        skip page (to avoid allocation)
         false    true          true        hugetlb_fault
         false    false         true        return page
      
      With this patch, we can call hugetlb_fault() and take proper actions (we
      wait for migration entries, fail with VM_FAULT_HWPOISON_LARGE for
      hwpoisoned entries,) and as the result we can dump all hugepages except
      for hwpoisoned ones.
      
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[2.6.34+?]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cc3a5bd
  4. 22 Mar, 2013 1 commit
    • Wanpeng Li's avatar
      mm/hugetlb: fix total hugetlbfs pages count when using memory overcommit accouting · d0028588
      Wanpeng Li authored
      hugetlb_total_pages is used for overcommit calculations but the current
      implementation considers only the default hugetlb page size (which is
      either the first defined hugepage size or the one specified by
      default_hugepagesz kernel boot parameter).
      
      If the system is configured for more than one hugepage size, which is
      possible since commit a137e1cc
      
       ("hugetlbfs: per mount huge page
      sizes") then the overcommit estimation done by __vm_enough_memory()
      (resp.  shown by meminfo_proc_show) is not precise - there is an
      impression of more available/allowed memory.  This can lead to an
      unexpected ENOMEM/EFAULT resp.  SIGSEGV when memory is accounted.
      
      Testcase:
        boot: hugepagesz=1G hugepages=1
        the default overcommit ratio is 50
        before patch:
      
          egrep 'CommitLimit' /proc/meminfo
          CommitLimit:     55434168 kB
      
        after patch:
      
          egrep 'CommitLimit' /proc/meminfo
          CommitLimit:     54909880 kB
      
      [akpm@linux-foundation.org: coding-style tweak]
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>		[3.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0028588
  5. 07 Mar, 2013 1 commit
  6. 24 Feb, 2013 2 commits
  7. 23 Feb, 2013 1 commit
  8. 05 Feb, 2013 1 commit
  9. 18 Dec, 2012 1 commit
  10. 13 Dec, 2012 3 commits
  11. 12 Dec, 2012 1 commit
  12. 11 Dec, 2012 1 commit
  13. 06 Dec, 2012 1 commit
  14. 09 Oct, 2012 6 commits
    • Andrew Morton's avatar
      mm: document PageHuge somewhat · 7795912c
      Andrew Morton authored
      
      
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7795912c
    • Sagi Grimberg's avatar
      mm: move all mmu notifier invocations to be done outside the PT lock · 2ec74c3e
      Sagi Grimberg authored
      In order to allow sleeping during mmu notifier calls, we need to avoid
      invoking them under the page table spinlock.  This patch solves the
      problem by calling invalidate_page notification after releasing the lock
      (but before freeing the page itself), or by wrapping the page invalidation
      with calls to invalidate_range_begin and invalidate_range_end.
      
      To prevent accidental changes to the invalidate_range_end arguments after
      the call to invalidate_range_begin, the patch introduces a convention of
      saving the arguments in consistently named locals:
      
      	unsigned long mmun_start;	/* For mmu_notifiers */
      	unsigned long mmun_end;	/* For mmu_notifiers */
      
      	...
      
      	mmun_start = ...
      	mmun_end = ...
      	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
      
      	...
      
      	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
      
      The patch changes code to use this convention for all calls to
      mmu_notifier_invalidate_range_start/end, except those where the calls are
      close enough so that anyone who glances at the code can see the values
      aren't changing.
      
      This patchset is a preliminary step towards on-demand paging design to be
      added to the RDMA stack.
      
      Why do we want on-demand paging for Infiniband?
      
        Applications register memory with an RDMA adapter using system calls,
        and subsequently post IO operations that refer to the corresponding
        virtual addresses directly to HW.  Until now, this was achieved by
        pinning the memory during the registration calls.  The goal of on demand
        paging is to avoid pinning the pages of registered memory regions (MRs).
         This will allow users the same flexibility they get when swapping any
        other part of their processes address spaces.  Instead of requiring the
        entire MR to fit in physical memory, we can allow the MR to be larger,
        and only fit the current working set in physical memory.
      
      Why should anyone care?  What problems are users currently experiencing?
      
        This can make programming with RDMA much simpler.  Today, developers
        that are working with more data than their RAM can hold need either to
        deregister and reregister memory regions throughout their process's
        life, or keep a single memory region and copy the data to it.  On demand
        paging will allow these developers to register a single MR at the
        beginning of their process's life, and let the operating system manage
        which pages needs to be fetched at a given time.  In the future, we
        might be able to provide a single memory access key for each process
        that would provide the entire process's address as one large memory
        region, and the developers wouldn't need to register memory regions at
        all.
      
      Is there any prospect that any other subsystems will utilise these
      infrastructural changes?  If so, which and how, etc?
      
        As for other subsystems, I understand that XPMEM wanted to sleep in
        MMU notifiers, as Christoph Lameter wrote at
        http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html
      
       and
        perhaps Andrea knows about other use cases.
      
        Scheduling in mmu notifications is required since we need to sync the
        hardware with the secondary page tables change.  A TLB flush of an IO
        device is inherently slower than a CPU TLB flush, so our design works by
        sending the invalidation request to the device, and waiting for an
        interrupt before exiting the mmu notifier handler.
      
      Avi said:
      
        kvm may be a buyer.  kvm::mmu_lock, which serializes guest page
        faults, also protects long operations such as destroying large ranges.
        It would be good to convert it into a spinlock, but as it is used inside
        mmu notifiers, this cannot be done.
      
        (there are alternatives, such as keeping the spinlock and using a
        generation counter to do the teardown in O(1), which is what the "may"
        is doing up there).
      
      [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
      Signed-off-by: default avatarAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarHaggai Eran <haggaie@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ec74c3e
    • Michal Hocko's avatar
      hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreach · 36e4f20a
      Michal Hocko authored
      Commit 0c176d52
      
       ("mm: hugetlb: fix pgoff computation when unmapping
      page from vma") fixed pgoff calculation but it has replaced it by
      vma_hugecache_offset() which is not approapriate for offsets used for
      vma_prio_tree_foreach() because that one expects index in page units
      rather than in huge_page_shift.
      
      Johannes said:
      
      : The resulting index may not be too big, but it can be too small: assume
      : hpage size of 2M and the address to unmap to be 0x200000.  This is regular
      : page index 512 and hpage index 1.  If you have a VMA that maps the file
      : only starting at the second huge page, that VMAs vm_pgoff will be 512 but
      : you ask for offset 1 and miss it even though it does map the page of
      : interest.  hugetlb_cow() will try to unmap, miss the vma, and retry the
      : cow until the allocation succeeds or the skipped vma(s) go away.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36e4f20a
    • Sachin Kamat's avatar
    • Michel Lespinasse's avatar
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse authored
      
      
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • Will Deacon's avatar
      mm: hugetlb: add arch hook for clearing page flags before entering pool · 5d3a551c
      Will Deacon authored
      
      
      The core page allocator ensures that page flags are zeroed when freeing
      pages via free_pages_check.  A number of architectures (ARM, PPC, MIPS)
      rely on this property to treat new pages as dirty with respect to the data
      cache and perform the appropriate flushing before mapping the pages into
      userspace.
      
      This can lead to cache synchronisation problems when using hugepages,
      since the allocator keeps its own pool of pages above the usual page
      allocator and does not reset the page flags when freeing a page into the
      pool.
      
      This patch adds a new architecture hook, arch_clear_hugepage_flags, so
      that architectures which rely on the page flags being in a particular
      state for fresh allocations can adjust the flags accordingly when a page
      is freed into the pool.
      
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d3a551c
  15. 01 Aug, 2012 12 commits
    • Mel Gorman's avatar
      mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables · d833352a
      Mel Gorman authored
      If a process creates a large hugetlbfs mapping that is eligible for page
      table sharing and forks heavily with children some of whom fault and
      others which destroy the mapping then it is possible for page tables to
      get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
      output a message to the kernel log.  The final teardown will trigger a
      BUG_ON in mm/filemap.c.
      
      This was reproduced in 3.4 but is known to have existed for a long time
      and goes back at least as far as 2.6.37.  It was probably was introduced
      in 2.6.20 by [39dde65c
      
      : shared page table for hugetlb page].  The messages
      look like this;
      
      [  ..........] Lots of bad pmd messages followed by this
      [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
      [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
      [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
      [  127.186778] ------------[ cut here ]------------
      [  127.186781] kernel BUG at mm/filemap.c:134!
      [  127.186782] invalid opcode: 0000 [#1] SMP
      [  127.186783] CPU 7
      [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
      [  127.186801]
      [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
      [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
      [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
      [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
      [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
      [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
      [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
      [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
      [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
      [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
      [  127.186821] Stack:
      [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
      [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
      [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
      [  127.186827] Call Trace:
      [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
      [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
      [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
      [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
      [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
      [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
      [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
      [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
      [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
      [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
      [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
      [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
      [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
      [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
      [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
      [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
      [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186870]  RSP <ffff8804144b5c08>
      [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
      
      The bug is a race and not always easy to reproduce.  To reproduce it I was
      doing the following on a single socket I7-based machine with 16G of RAM.
      
      $ hugeadm --pool-pages-max DEFAULT:13G
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
      $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
      
      On my particular machine, it usually triggers within 10 minutes but
      enabling debug options can change the timing such that it never hits.
      Once the bug is triggered, the machine is in trouble and needs to be
      rebooted.  The machine will respond but processes accessing proc like "ps
      aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
      hard reset or a sysrq-b.
      
      The basic problem is a race between page table sharing and teardown.  For
      the most part page table sharing depends on i_mmap_mutex.  In some cases,
      it is also taking the mm->page_table_lock for the PTE updates but with
      shared page tables, it is the i_mmap_mutex that is more important.
      
      Unfortunately it appears to be also insufficient. Consider the following
      situation
      
      Process A					Process B
      ---------					---------
      hugetlb_fault					shmdt
        						LockWrite(mmap_sem)
          						  do_munmap
      						    unmap_region
      						      unmap_vmas
      						        unmap_single_vma
      						          unmap_hugepage_range
            						            Lock(i_mmap_mutex)
      							    Lock(mm->page_table_lock)
      							    huge_pmd_unshare/unmap tables <--- (1)
      							    Unlock(mm->page_table_lock)
            						            Unlock(i_mmap_mutex)
        huge_pte_alloc				      ...
          Lock(i_mmap_mutex)				      ...
          vma_prio_walk, find svma, spte		      ...
          Lock(mm->page_table_lock)			      ...
          share spte					      ...
          Unlock(mm->page_table_lock)			      ...
          Unlock(i_mmap_mutex)			      ...
        hugetlb_no_page									  <--- (2)
      						      free_pgtables
      						        unlink_file_vma
      							hugetlb_free_pgd_range
      						    remove_vma_list
      
      In this scenario, it is possible for Process A to share page tables with
      Process B that is trying to tear them down.  The i_mmap_mutex on its own
      does not prevent Process A walking Process B's page tables.  At (1) above,
      the page tables are not shared yet so it unmaps the PMDs.  Process A sets
      up page table sharing and at (2) faults a new entry.  Process B then trips
      up on it in free_pgtables.
      
      This patch fixes the problem by adding a new function
      __unmap_hugepage_range_final that is only called when the VMA is about to
      be destroyed.  This function clears VM_MAYSHARE during
      unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
      ineligible for sharing and avoids the race.  Superficially this looks like
      it would then be vunerable to truncate and madvise issues but hugetlbfs
      has its own truncate handlers so does not use unmap_mapping_range() and
      does not support madvise(DONTNEED).
      
      This should be treated as a -stable candidate if it is merged.
      
      Test program is as follows. The test case was mostly written by Michal
      Hocko with a few minor changes to reproduce this bug.
      
      ==== CUT HERE ====
      
      static size_t huge_page_size = (2UL << 20);
      static size_t nr_huge_page_A = 512;
      static size_t nr_huge_page_B = 5632;
      
      unsigned int get_random(unsigned int max)
      {
      	struct timeval tv;
      
      	gettimeofday(&tv, NULL);
      	srandom(tv.tv_usec);
      	return random() % max;
      }
      
      static void play(void *addr, size_t size)
      {
      	unsigned char *start = addr,
      		      *end = start + size,
      		      *a;
      	start += get_random(size/2);
      
      	/* we could itterate on huge pages but let's give it more time. */
      	for (a = start; a < end; a += 4096)
      		*a = 0;
      }
      
      int main(int argc, char **argv)
      {
      	key_t key = IPC_PRIVATE;
      	size_t sizeA = nr_huge_page_A * huge_page_size;
      	size_t sizeB = nr_huge_page_B * huge_page_size;
      	int shmidA, shmidB;
      	void *addrA = NULL, *addrB = NULL;
      	int nr_children = 300, n = 0;
      
      	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      
      fork_child:
      	switch(fork()) {
      		case 0:
      			switch (n%3) {
      			case 0:
      				play(addrA, sizeA);
      				break;
      			case 1:
      				play(addrB, sizeB);
      				break;
      			case 2:
      				break;
      			}
      			break;
      		case -1:
      			perror("fork:");
      			break;
      		default:
      			if (++n < nr_children)
      				goto fork_child;
      			play(addrA, sizeA);
      			break;
      	}
      	shmdt(addrA);
      	shmdt(addrB);
      	do {
      		wait(NULL);
      	} while (--n > 0);
      	shmctl(shmidA, IPC_RMID, NULL);
      	shmctl(shmidB, IPC_RMID, NULL);
      	return 0;
      }
      
      [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d833352a
    • Aneesh Kumar K.V's avatar
      hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list. · 94ae8ba7
      Aneesh Kumar K.V authored
      
      
      A page's hugetlb cgroup assignment and movement to the active list should
      occur with hugetlb_lock held.  Otherwise when we remove the hugetlb cgroup
      we will iterate the active list and find pages with NULL hugetlb cgroup
      values.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94ae8ba7
    • Aneesh Kumar K.V's avatar
      hugetlb: move all the in use pages to active list · 79dbb236
      Aneesh Kumar K.V authored
      
      
      When we fail to allocate pages from the reserve pool, hugetlb tries to
      allocate huge pages using alloc_buddy_huge_page.  Add these to the active
      list.  We also need to add the huge page we allocate when we soft offline
      the oldpage to active list.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79dbb236
    • Aneesh Kumar K.V's avatar
      hugetlb/cgroup: add hugetlb cgroup control files · abb8206c
      Aneesh Kumar K.V authored
      
      
      Add the control files for hugetlb controller
      
      [akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g]
      [akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/]
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abb8206c
    • Aneesh Kumar K.V's avatar
      hugetlb/cgroup: add charge/uncharge routines for hugetlb cgroup · 6d76dcf4
      Aneesh Kumar K.V authored
      
      
      Add the charge and uncharge routines for hugetlb cgroup.  We do cgroup
      charging in page alloc and uncharge in compound page destructor.
      Assigning page's hugetlb cgroup is protected by hugetlb_lock.
      
      [liwp@linux.vnet.ibm.com: add huge_page_order check to avoid incorrect uncharge]
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarWanpeng Li <liwp.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d76dcf4
    • Aneesh Kumar K.V's avatar
      hugetlb/cgroup: add the cgroup pointer to page lru · 9dd540e2
      Aneesh Kumar K.V authored
      
      
      Add the hugetlb cgroup pointer to 3rd page lru.next.  This limit the usage
      to hugetlb cgroup to only hugepages with 3 or more normal pages.  I guess
      that is an acceptable limitation.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9dd540e2
    • Aneesh Kumar K.V's avatar
      hugetlb: make some static variables global · c3f38a38
      Aneesh Kumar K.V authored
      
      
      We will use them later in hugetlb_cgroup.c
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3f38a38
    • Aneesh Kumar K.V's avatar
      hugetlb: add a list for tracking in-use HugeTLB pages · 0edaecfa
      Aneesh Kumar K.V authored
      
      
      hugepage_activelist will be used to track currently used HugeTLB pages.
      We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal.
      On cgroup removal we update the page's HugeTLB cgroup to point to parent
      cgroup.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0edaecfa
    • Aneesh Kumar K.V's avatar
      hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages · 24669e58
      Aneesh Kumar K.V authored
      
      
      Use a mmu_gather instead of a temporary linked list for accumulating pages
      when we unmap a hugepage range
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24669e58
    • Aneesh Kumar K.V's avatar
      hugetlb: add an inline helper for finding hstate index · 972dc4de
      Aneesh Kumar K.V authored
      
      
      Add an inline helper and use it in the code.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      972dc4de
    • Aneesh Kumar K.V's avatar
      hugetlb: don't use ERR_PTR with VM_FAULT* values · 76dcee75
      Aneesh Kumar K.V authored
      
      
      The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
      VM_FAULT_* values will not exceed MAX_ERRNO value.  Decouple the
      VM_FAULT_* values from MAX_ERRNO.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76dcee75
    • Aneesh Kumar K.V's avatar
      hugetlb: rename max_hstate to hugetlb_max_hstate · 47d38344
      Aneesh Kumar K.V authored
      
      
      This patchset implements a cgroup resource controller for HugeTLB pages.
      The controller allows to limit the HugeTLB usage per control group and
      enforces the controller limit during page fault.  Since HugeTLB doesn't
      support page reclaim, enforcing the limit at page fault time implies that,
      the application will get SIGBUS signal if it tries to access HugeTLB pages
      beyond its limit.  This requires the application to know beforehand how
      much HugeTLB pages it would require for its use.
      
      The goal is to control how many HugeTLB pages a group of task can
      allocate.  It can be looked at as an extension of the existing quota
      interface which limits the number of HugeTLB pages per hugetlbfs
      superblock.  HPC job scheduler requires jobs to specify their resource
      requirements in the job file.  Once their requirements can be met, job
      schedulers like (SLURM) will schedule the job.  We need to make sure that
      the jobs won't consume more resources than requested.  If they do we
      should either error out or kill the application.
      
      This patch:
      
      Rename max_hstate to hugetlb_max_hstate.  We will be using this from other
      subsystems like hugetlb controller in later patches.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47d38344
  16. 30 May, 2012 1 commit
    • Dave Hansen's avatar
      mm: fix vma_resv_map() NULL pointer · 4523e145
      Dave Hansen authored
      hugetlb_reserve_pages() can be used for either normal file-backed
      hugetlbfs mappings, or MAP_HUGETLB.  In the MAP_HUGETLB, semi-anonymous
      mode, there is not a VMA around.  The new call to resv_map_put() assumed
      that there was, and resulted in a NULL pointer dereference:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
        IP: vma_resv_map+0x9/0x30
        PGD 141453067 PUD 1421e1067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP
        ...
        Pid: 14006, comm: trinity-child6 Not tainted 3.4.0+ #36
        RIP: vma_resv_map+0x9/0x30
        ...
        Process trinity-child6 (pid: 14006, threadinfo ffff8801414e0000, task ffff8801414f26b0)
        Call Trace:
          resv_map_put+0xe/0x40
          hugetlb_reserve_pages+0xa6/0x1d0
          hugetlb_file_setup+0x102/0x2c0
          newseg+0x115/0x360
          ipcget+0x1ce/0x310
          sys_shmget+0x5a/0x60
          system_call_fastpath+0x16/0x1b
      
      This was reported by Dave Jones, but was reproducible with the
      libhugetlbfs test cases, so shame on me for not running them in the
      first place.
      
      With this, the oops is gone, and the output of libhugetlbfs's
      run_tests.py is identical to plain 3.4 again.
      
      [ Marked for stable, since this was introduced by commit c50ac050
      
      
        ("hugetlb: fix resv_map leak in error path") which was also marked for
        stable ]
      
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <stable@vger.kernel.org>        [2.6.32+]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4523e145
  17. 29 May, 2012 2 commits
  18. 25 May, 2012 1 commit
  19. 10 May, 2012 1 commit