1. 06 Jan, 2009 1 commit
  2. 23 Oct, 2008 1 commit
  3. 27 Jul, 2008 1 commit
  4. 24 Jul, 2008 8 commits
    • Jon Tollefson's avatar
      hugetlb: allow arch overridden hugepage allocation · 53ba51d2
      Jon Tollefson authored
      
      
      Allow alloc_bootmem_huge_page() to be overridden by architectures that
      can't always use bootmem.  This requires huge_boot_pages to be available
      for use by this function.
      
      This is required for powerpc 16G pages, which have to be reserved prior to
      boot-time.  The location of these pages are indicated in the device tree.
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Signed-off-by: default avatarJon Tollefson <kniht@linux.vnet.ibm.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53ba51d2
    • Andi Kleen's avatar
      hugetlb: introduce pud_huge · ceb86879
      Andi Kleen authored
      
      
      Straight forward extensions for huge pages located in the PUD instead of
      PMDs.
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ceb86879
    • Nishanth Aravamudan's avatar
      hugetlb: new sysfs interface · a3437870
      Nishanth Aravamudan authored
      
      
      Provide new hugepages user APIs that are more suited to multiple hstates
      in sysfs.  There is a new directory, /sys/kernel/hugepages.  Underneath
      that directory there will be a directory per-supported hugepage size,
      e.g.:
      
      /sys/kernel/hugepages/hugepages-64kB
      /sys/kernel/hugepages/hugepages-16384kB
      /sys/kernel/hugepages/hugepages-16777216kB
      
      corresponding to 64k, 16m and 16g respectively.  Within each
      hugepages-size directory there are a number of files, corresponding to the
      tracked counters in the hstate, e.g.:
      
      /sys/kernel/hugepages/hugepages-64/nr_hugepages
      /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
      /sys/kernel/hugepages/hugepages-64/free_hugepages
      /sys/kernel/hugepages/hugepages-64/resv_hugepages
      /sys/kernel/hugepages/hugepages-64/surplus_hugepages
      
      Of these files, the first two are read-write and the latter three are
      read-only.  The size of the hugepage being manipulated is trivially
      deducible from the enclosing directory and is always expressed in kB (to
      match meminfo).
      
      [dave@linux.vnet.ibm.com: fix build]
      [nacc@us.ibm.com: hugetlb: hang off of /sys/kernel/mm rather than /sys/kernel]
      [nacc@us.ibm.com: hugetlb: remove CONFIG_SYSFS dependency]
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3437870
    • Andi Kleen's avatar
      hugetlbfs: per mount huge page sizes · a137e1cc
      Andi Kleen authored
      
      
      Add the ability to configure the hugetlb hstate used on a per mount basis.
      
      - Add a new pagesize= option to the hugetlbfs mount that allows setting
        the page size
      - This option causes the mount code to find the hstate corresponding to the
        specified size, and sets up a pointer to the hstate in the mount's
        superblock.
      - Change the hstate accessors to use this information rather than the
        global_hstate they were using (requires a slight change in mm/memory.c
        so we don't NULL deref in the error-unmap path -- see comments).
      
      [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a137e1cc
    • Andi Kleen's avatar
      hugetlb: multiple hstates for multiple page sizes · e5ff2159
      Andi Kleen authored
      
      
      Add basic support for more than one hstate in hugetlbfs.  This is the key
      to supporting multiple hugetlbfs page sizes at once.
      
      - Rather than a single hstate, we now have an array, with an iterator
      - default_hstate continues to be the struct hstate which we use by default
      - Add functions for architectures to register new hstates
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5ff2159
    • Andi Kleen's avatar
      hugetlb: modular state for hugetlb page size · a5516438
      Andi Kleen authored
      
      
      The goal of this patchset is to support multiple hugetlb page sizes.  This
      is achieved by introducing a new struct hstate structure, which
      encapsulates the important hugetlb state and constants (eg.  huge page
      size, number of huge pages currently allocated, etc).
      
      The hstate structure is then passed around the code which requires these
      fields, they will do the right thing regardless of the exact hstate they
      are operating on.
      
      This patch adds the hstate structure, with a single global instance of it
      (default_hstate), and does the basic work of converting hugetlb to use the
      hstate.
      
      Future patches will add more hstate structures to allow for different
      hugetlbfs mounts to have different page sizes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5516438
    • Mel Gorman's avatar
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE)... · 04f2cbe3
      Mel Gorman authored
      
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
      
      After patch 2 in this series, a process that successfully calls mmap() for
      a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
      process calls fork().  At that point, the next write fault from the parent
      could fail due to COW if the child still has a reference.
      
      We only reserve pages for the parent but a copy must be made to avoid
      leaking data from the parent to the child after fork().  Reserves could be
      taken for both parent and child at fork time to guarantee faults but if
      the mapping is large it is highly likely we will not have sufficient pages
      for the reservation, and it is common to fork only to exec() immediatly
      after.  A failure here would be very undesirable.
      
      Note that the current behaviour of mainline with MAP_PRIVATE pages is
      pretty bad.  The following situation is allowed to occur today.
      
      1. Process calls mmap(MAP_PRIVATE)
      2. Process calls mlock() to fault all pages and makes sure it succeeds
      3. Process forks()
      4. Process writes to MAP_PRIVATE mapping while child still exists
      5. If the COW fails at this point, the process gets SIGKILLed even though it
         had taken care to ensure the pages existed
      
      This patch improves the situation by guaranteeing the reliability of the
      process that successfully calls mmap().  When the parent performs COW, it
      will try to satisfy the allocation without using reserves.  If that fails
      the parent will steal the page leaving any children without a page.
      Faults from the child after that point will result in failure.  If the
      child COW happens first, an attempt will be made to allocate the page
      without reserves and the child will get SIGKILLed on failure.
      
      To summarise the new behaviour:
      
      1. If the original mapper performs COW on a private mapping with multiple
         references, it will attempt to allocate a hugepage from the pool or
         the buddy allocator without using the existing reserves. On fail, VMAs
         mapping the same area are traversed and the page being COW'd is unmapped
         where found. It will then steal the original page as the last mapper in
         the normal way.
      
      2. The VMAs the pages were unmapped from are flagged to note that pages
         with data no longer exist. Future no-page faults on those VMAs will
         terminate the process as otherwise it would appear that data was corrupted.
         A warning is printed to the console that this situation occured.
      
      2. If the child performs COW first, it will attempt to satisfy the COW
         from the pool if there are enough pages or via the buddy allocator if
         overcommit is allowed and the buddy allocator can satisfy the request. If
         it fails, the child will be killed.
      
      If the pool is large enough, existing applications will not notice that
      the reserves were a factor.  Existing applications depending on the
      no-reserves been set are unlikely to exist as for much of the history of
      hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
      point or failing the mmap().
      
      [npiggin@suse.de: fix CONFIG_HUGETLB=n build]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04f2cbe3
    • Mel Gorman's avatar
      hugetlb: reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() · a1e78772
      Mel Gorman authored
      
      
      This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
      a similar manner to the reservations taken for MAP_SHARED mappings.  The
      reserve count is accounted both globally and on a per-VMA basis for
      private mappings.  This guarantees that a process that successfully calls
      mmap() will successfully fault all pages in the future unless fork() is
      called.
      
      The characteristics of private mappings of hugetlbfs files behaviour after
      this patch are;
      
      1. The process calling mmap() is guaranteed to succeed all future faults until
         it forks().
      2. On fork(), the parent may die due to SIGKILL on writes to the private
         mapping if enough pages are not available for the COW. For reasonably
         reliable behaviour in the face of a small huge page pool, children of
         hugepage-aware processes should not reference the mappings; such as
         might occur when fork()ing to exec().
      3. On fork(), the child VMAs inherit no reserves. Reads on pages already
         faulted by the parent will succeed. Successful writes will depend on enough
         huge pages being free in the pool.
      4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
         and at fault time otherwise.
      
      Before this patch, all reads or writes in the child potentially needs page
      allocations that can later lead to the death of the parent.  This applies
      to reads and writes of uninstantiated pages as well as COW.  After the
      patch it is only a write to an instantiated page that causes problems.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1e78772
  5. 28 Apr, 2008 1 commit
    • Gerald Schaefer's avatar
      hugetlbfs: architecture header cleanup · 6d779079
      Gerald Schaefer authored
      
      
      This patch moves all architecture functions for hugetlb to architecture header
      files (include/asm-foo/hugetlb.h) and converts all macros to inline functions.
       It also removes (!) ARCH_HAS_HUGEPAGE_ONLY_RANGE,
      ARCH_HAS_HUGETLB_FREE_PGD_RANGE, ARCH_HAS_PREPARE_HUGEPAGE_RANGE,
      ARCH_HAS_SETCLEAR_HUGE_PTE and ARCH_HAS_HUGETLB_PREFAULT_HOOK.
      
      Getting rid of the ARCH_HAS_xxx #ifdef and macro fugliness should increase
      readability and maintainability, at the price of some code duplication.  An
      asm-generic common part would have reduced the loc, but we would end up with
      new ARCH_HAS_xxx defines eventually.
      Acked-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d779079
  6. 14 Feb, 2008 1 commit
  7. 08 Feb, 2008 1 commit
  8. 18 Dec, 2007 2 commits
    • Nishanth Aravamudan's avatar
      Revert "hugetlb: Add hugetlb_dynamic_pool sysctl" · 368d2c63
      Nishanth Aravamudan authored
      This reverts commit 54f9f80d
      
       ("hugetlb:
      Add hugetlb_dynamic_pool sysctl")
      
      Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
      sysctl is not needed, as its semantics can be expressed by 0 in the
      overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
      (pool enabled).
      
      (Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      368d2c63
    • Nishanth Aravamudan's avatar
      hugetlb: introduce nr_overcommit_hugepages sysctl · d1c3fb1f
      Nishanth Aravamudan authored
      
      
      hugetlb: introduce nr_overcommit_hugepages sysctl
      
      While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
      became convinced that having a boolean sysctl was insufficient:
      
      1) To support per-node control of hugepages, I have previously submitted
      patches to add a sysfs attribute related to nr_hugepages. However, with
      a boolean global value and per-mount quota enforcement constraining the
      dynamic pool, adding corresponding control of the dynamic pool on a
      per-node basis seems inconsistent to me.
      
      2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
      mount points is, arguably, more arduous than it needs to be. Each quota
      would need to be set separately, and the sum would need to be monitored.
      
      To ease the administration, and to help make the way for per-node
      control of the static & dynamic hugepage pool, I added a separate
      sysctl, nr_overcommit_hugepages. This value serves as a high watermark
      for the overall hugepage pool, while nr_hugepages serves as a low
      watermark. The boolean sysctl can then be removed, as the condition
      
      	nr_overcommit_hugepages > 0
      
      indicates the same administrative setting as
      
      	hugetlb_dynamic_pool == 1
      
      Quotas still serve as local enforcement of the size of the pool on a
      per-mount basis.
      
      A few caveats:
      
      1) There is a race whereby the global surplus huge page counter is
      incremented before a hugepage has allocated. Another process could then
      try grow the pool, and fail to convert a surplus huge page to a normal
      huge page and instead allocate a fresh huge page. I believe this is
      benign, as no memory is leaked (the actual pages are still tracked
      correctly) and the counters won't go out of sync.
      
      2) Shrinking the static pool while a surplus is in effect will allow the
      number of surplus huge pages to exceed the overcommit value. As long as
      this condition holds, however, no more surplus huge pages will be
      allowed on the system until one of the two sysctls are increased
      sufficiently, or the surplus huge pages go out of use and are freed.
      
      Successfully tested on x86_64 with the current libhugetlbfs snapshot,
      modified to use the new sysctl.
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1c3fb1f
  9. 15 Nov, 2007 3 commits
  10. 16 Oct, 2007 1 commit
  11. 31 Aug, 2007 1 commit
    • David Gibson's avatar
      hugepage: fix broken check for offset alignment in hugepage mappings · dec4ad86
      David Gibson authored
      For hugepage mappings, the file offset, like the address and size, needs to
      be aligned to the size of a hugepage.
      
      In commit 68589bc3, the check for this was
      moved into prepare_hugepage_range() along with the address and size checks.
       But since BenH's rework of the get_unmapped_area() paths leading up to
      commit 4b1d8929
      
      , prepare_hugepage_range()
      is only called for MAP_FIXED mappings, not for other mappings.  This means
      we're no longer ever checking for an aligned offset - I've confirmed that
      mmap() will (apparently) succeed with a misaligned offset on both powerpc
      and i386 at least.
      
      This patch restores the check, removing it from prepare_hugepage_range()
      and putting it back into hugetlbfs_file_mmap().  I'm putting it there,
      rather than in the get_unmapped_area() path so it only needs to go in one
      place, than separately in the half-dozen or so arch-specific
      implementations of hugetlb_get_unmapped_area().
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dec4ad86
  12. 30 Jul, 2007 1 commit
    • Alexey Dobriyan's avatar
      Remove fs.h from mm.h · 4e950f6f
      Alexey Dobriyan authored
      
      
      Remove fs.h from mm.h. For this,
       1) Uninline vma_wants_writenotify(). It's pretty huge anyway.
       2) Add back fs.h or less bloated headers (err.h) to files that need it.
      
      As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files
      rebuilt down to 3444 (-12.3%).
      
      Cross-compile tested without regressions on my two usual configs and (sigh):
      
      alpha              arm-mx1ads        mips-bigsur          powerpc-ebony
      alpha-allnoconfig  arm-neponset      mips-capcella        powerpc-g5
      alpha-defconfig    arm-netwinder     mips-cobalt          powerpc-holly
      alpha-up           arm-netx          mips-db1000          powerpc-iseries
      arm                arm-ns9xxx        mips-db1100          powerpc-linkstation
      arm-assabet        arm-omap_h2_1610  mips-db1200          powerpc-lite5200
      arm-at91rm9200dk   arm-onearm        mips-db1500          powerpc-maple
      arm-at91rm9200ek   arm-picotux200    mips-db1550          powerpc-mpc7448_hpc2
      arm-at91sam9260ek  arm-pleb          mips-ddb5477         powerpc-mpc8272_ads
      arm-at91sam9261ek  arm-pnx4008       mips-decstation      powerpc-mpc8313_rdb
      arm-at91sam9263ek  arm-pxa255-idp    mips-e55             powerpc-mpc832x_mds
      arm-at91sam9rlek   arm-realview      mips-emma2rh         powerpc-mpc832x_rdb
      arm-ateb9200       arm-realview-smp  mips-excite          powerpc-mpc834x_itx
      arm-badge4         arm-rpc           mips-fulong          powerpc-mpc834x_itxgp
      arm-carmeva        arm-s3c2410       mips-ip22            powerpc-mpc834x_mds
      arm-cerfcube       arm-shannon       mips-ip27            powerpc-mpc836x_mds
      arm-clps7500       arm-shark         mips-ip32            powerpc-mpc8540_ads
      arm-collie         arm-simpad        mips-jazz            powerpc-mpc8544_ds
      arm-corgi          arm-spitz         mips-jmr3927         powerpc-mpc8560_ads
      arm-csb337         arm-trizeps4      mips-malta           powerpc-mpc8568mds
      arm-csb637         arm-versatile     mips-mipssim         powerpc-mpc85xx_cds
      arm-ebsa110        i386              mips-mpc30x          powerpc-mpc8641_hpcn
      arm-edb7211        i386-allnoconfig  mips-msp71xx         powerpc-mpc866_ads
      arm-em_x270        i386-defconfig    mips-ocelot          powerpc-mpc885_ads
      arm-ep93xx         i386-up           mips-pb1100          powerpc-pasemi
      arm-footbridge     ia64              mips-pb1500          powerpc-pmac32
      arm-fortunet       ia64-allnoconfig  mips-pb1550          powerpc-ppc64
      arm-h3600          ia64-bigsur       mips-pnx8550-jbs     powerpc-prpmc2800
      arm-h7201          ia64-defconfig    mips-pnx8550-stb810  powerpc-ps3
      arm-h7202          ia64-gensparse    mips-qemu            powerpc-pseries
      arm-hackkit        ia64-sim          mips-rbhma4200       powerpc-up
      arm-integrator     ia64-sn2          mips-rbhma4500       s390
      arm-iop13xx        ia64-tiger        mips-rm200           s390-allnoconfig
      arm-iop32x         ia64-up           mips-sb1250-swarm    s390-defconfig
      arm-iop33x         ia64-zx1          mips-sead            s390-up
      arm-ixp2000        m68k              mips-tb0219          sparc
      arm-ixp23xx        m68k-amiga        mips-tb0226          sparc-allnoconfig
      arm-ixp4xx         m68k-apollo       mips-tb0287          sparc-defconfig
      arm-jornada720     m68k-atari        mips-workpad         sparc-up
      arm-kafa           m68k-bvme6000     mips-wrppmc          sparc64
      arm-kb9202         m68k-hp300        mips-yosemite        sparc64-allnoconfig
      arm-ks8695         m68k-mac          parisc               sparc64-defconfig
      arm-lart           m68k-mvme147      parisc-allnoconfig   sparc64-up
      arm-lpd270         m68k-mvme16x      parisc-defconfig     um-x86_64
      arm-lpd7a400       m68k-q40          parisc-up            x86_64
      arm-lpd7a404       m68k-sun3         powerpc              x86_64-allnoconfig
      arm-lubbock        m68k-sun3x        powerpc-cell         x86_64-defconfig
      arm-lusl7200       mips              powerpc-celleb       x86_64-up
      arm-mainstone      mips-atlas        powerpc-chrp32
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e950f6f
  13. 17 Jul, 2007 1 commit
    • Mel Gorman's avatar
      Allow huge page allocations to use GFP_HIGH_MOVABLE · 396faf03
      Mel Gorman authored
      
      
      Huge pages are not movable so are not allocated from ZONE_MOVABLE.  However,
      as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
      can be used to satisfy hugepage allocations even when the system has been
      running a long time.  This allows an administrator to resize the hugepage pool
      at runtime depending on the size of ZONE_MOVABLE.
      
      This patch adds a new sysctl called hugepages_treat_as_movable.  When a
      non-zero value is written to it, future allocations for the huge page pool
      will use ZONE_MOVABLE.  Despite huge pages being non-movable, we do not
      introduce additional external fragmentation of note as huge pages are always
      the largest contiguous block we care about.
      
      [akpm@linux-foundation.org: various fixes]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      396faf03
  14. 16 Jun, 2007 1 commit
    • Eric W. Biederman's avatar
      shm: fix the filename of hugetlb sysv shared memory · 9d66586f
      Eric W. Biederman authored
      
      
      Some user space tools need to identify SYSV shared memory when examining
      /proc/<pid>/maps.  To do so they look for a block device with major zero, a
      dentry named SYSV<sysv key>, and having the minor of the internal sysv
      shared memory kernel mount.
      
      To help these tools and to make it easier for people just browsing
      /proc/<pid>/maps this patch modifies hugetlb sysv shared memory to use the
      SYSV<key> dentry naming convention.
      
      User space tools will still have to be aware that hugetlb sysv shared
      memory lives on a different internal kernel mount and so has a different
      block device minor number from the rest of sysv shared memory.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Albert Cahalan <acahalan@gmail.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d66586f
  15. 07 May, 2007 1 commit
  16. 02 Mar, 2007 1 commit
    • Adam Litke's avatar
      [PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments · 516dffdc
      Adam Litke authored
      
      
      This patch provides the following hugetlb-related fixes to the recent stacked
      shm files changes:
       - Update is_file_hugepages() so it will reconize hugetlb shm segments.
       - get_unmapped_area must be called with the nested file struct to handle
         the sfd->file->f_ops->get_unmapped_area == NULL case.
       - The fsync f_op must be wrapped since it is specified in the hugetlbfs
         f_ops.
      
      This is based on proposed fixes from Eric Biederman that were debugged and
      tested by me.  Without it, attempting to use hugetlb shared memory segments
      on powerpc (and likely ia64) will kill your box.
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarWilliam Irwin <bill.irwin@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      516dffdc
  17. 07 Dec, 2006 1 commit
    • Chen, Kenneth W's avatar
      [PATCH] shared page table for hugetlb page · 39dde65c
      Chen, Kenneth W authored
      
      
      Following up with the work on shared page table done by Dave McCracken.  This
      set of patch target shared page table for hugetlb memory only.
      
      The shared page table is particular useful in the situation of large number of
      independent processes sharing large shared memory segments.  In the normal
      page case, the amount of memory saved from process' page table is quite
      significant.  For hugetlb, the saving on page table memory is not the primary
      objective (as hugetlb itself already cuts down page table overhead
      significantly), instead, the purpose of using shared page table on hugetlb is
      to allow faster TLB refill and smaller cache pollution upon TLB miss.
      
      With PT sharing, pte entries are shared among hundreds of processes, the cache
      consumption used by all the page table is smaller and in return, application
      gets much higher cache hit ratio.  One other effect is that cache hit ratio
      with hardware page walker hitting on pte in cache will be higher and this
      helps to reduce tlb miss latency.  These two effects contribute to higher
      application performance.
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Dave McCracken <dmccr@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      39dde65c
  18. 14 Nov, 2006 1 commit
    • Hugh Dickins's avatar
      [PATCH] hugetlb: prepare_hugepage_range check offset too · 68589bc3
      Hugh Dickins authored
      
      
      (David:)
      
      If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
      because the given file offset is not hugepage aligned - then do_mmap_pgoff
      will go to the unmap_and_free_vma backout path.
      
      But at this stage the vma hasn't been marked as hugepage, and the backout path
      will call unmap_region() on it.  That will eventually call down to the
      non-hugepage version of unmap_page_range().  On ppc64, at least, that will
      cause serious problems if there are any existing hugepage pagetable entries in
      the vicinity - for example if there are any other hugepage mappings under the
      same PUD.  unmap_page_range() will trigger a bad_pud() on the hugepage pud
      entries.  I suspect this will also cause bad problems on ia64, though I don't
      have a machine to test it on.
      
      (Hugh:)
      
      prepare_hugepage_range() should check file offset alignment when it checks
      virtual address and length, to stop MAP_FIXED with a bad huge offset from
      unmapping before it fails further down.  PowerPC should apply the same
      prepare_hugepage_range alignment checks as ia64 and all the others do.
      
      Then none of the alignment checks in hugetlbfs_file_mmap are required (nor
      is the check for too small a mapping); but even so, move up setting of
      VM_HUGETLB and add a comment to warn of what David Gibson discovered - if
      hugetlbfs_file_mmap fails before setting it, do_mmap_pgoff's unmap_region
      when unwinding from error will go the non-huge way, which may cause bad
      behaviour on architectures (powerpc and ia64) which segregate their huge
      mappings into a separate region of the address space.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      68589bc3
  19. 11 Oct, 2006 1 commit
  20. 23 Jun, 2006 1 commit
    • Chen, Kenneth W's avatar
      [PATCH] tightening hugetlb strict accounting · a43a8c39
      Chen, Kenneth W authored
      
      
      Current hugetlb strict accounting for shared mapping always assume mapping
      starts at zero file offset and reserves pages between zero and size of the
      file.  This assumption often reserves (or lock down) a lot more pages then
      necessary if application maps at none zero file offset.  libhugetlbfs is
      one example that requires proper reservation on shared mapping starts at
      none zero offset.
      
      This patch extends the reservation and hugetlb strict accounting to support
      any arbitrary pair of (offset, len), resulting a much more robust and
      accurate scheme.  More importantly, it won't lock down any hugetlb pages
      outside file mapping.
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a43a8c39
  21. 28 Mar, 2006 1 commit
  22. 22 Mar, 2006 6 commits
    • David Gibson's avatar
      [PATCH] hugepage: is_aligned_hugepage_range() cleanup · 42b88bef
      David Gibson authored
      
      
      Quite a long time back, prepare_hugepage_range() replaced
      is_aligned_hugepage_range() as the callback from mm/mmap.c to arch code to
      verify if an address range is suitable for a hugepage mapping.
      is_aligned_hugepage_range() stuck around, but only to implement
      prepare_hugepage_range() on archs which didn't implement their own.
      
      Most archs (everything except ia64 and powerpc) used the same
      implementation of is_aligned_hugepage_range().  On powerpc, which
      implements its own prepare_hugepage_range(), the custom version was never
      used.
      
      In addition, "is_aligned_hugepage_range()" was a bad name, because it
      suggests it returns true iff the given range is a good hugepage range,
      whereas in fact it returns 0-or-error (so the sense is reversed).
      
      This patch cleans up by abolishing is_aligned_hugepage_range().  Instead
      prepare_hugepage_range() is defined directly.  Most archs use the default
      version, which simply checks the given region is aligned to the size of a
      hugepage.  ia64 and powerpc define custom versions.  The ia64 one simply
      checks that the range is in the correct address space region in addition to
      being suitably aligned.  The powerpc version (just as previously) checks
      for suitable addresses, and if necessary performs low-level MMU frobbing to
      set up new areas for use by hugepages.
      
      No libhugetlbfs testsuite regressions on ppc64 (POWER5 LPAR).
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarZhang Yanmin <yanmin.zhang@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      42b88bef
    • David Gibson's avatar
      [PATCH] hugepage: Move hugetlb_free_pgd_range() prototype to hugetlb.h · 3915bcf3
      David Gibson authored
      
      
      The optional hugepage callback, hugetlb_free_pgd_range() is presently
      implemented non-trivially only on ia64 (but I plan to add one for powerpc
      shortly).  It has its own prototype for the function in asm-ia64/pgtable.h.
       However, since the function is called from generic code, it make sense for
      its prototype to be in the generic hugetlb.h header file, as the protypes
      other arch callbacks already are (prepare_hugepage_range(),
      set_huge_pte_at(), etc.).  This patch makes it so.
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3915bcf3
    • David Gibson's avatar
      [PATCH] hugepage: Fix hugepage logic in free_pgtables() · 9da61aef
      David Gibson authored
      
      
      free_pgtables() has special logic to call hugetlb_free_pgd_range() instead
      of the normal free_pgd_range() on hugepage VMAs.  However, the test it uses
      to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized
      range at the start of the vma.  is_hugepage_only_range() will return true
      if the given range has any intersection with a hugepage address region, and
      in this case the given region need not be hugepage aligned.  So, for
      example, this test can return true if called on, say, a 4k VMA immediately
      preceding a (nicely aligned) hugepage VMA.
      
      At present we get away with this because the powerpc version of
      hugetlb_free_pgd_range() is just a call to free_pgd_range().  On ia64 (the
      only other arch with a non-trivial is_hugepage_only_range()) we get away
      with it for a different reason; the hugepage area is not contiguous with
      the rest of the user address space, and VMAs are not permitted in between,
      so the test can't return a false positive there.
      
      Nonetheless this should be fixed.  We do that in the patch below by
      replacing the is_hugepage_only_range() test with an explicit test of the
      VMA using is_vm_hugetlb_page().
      
      This in turn changes behaviour for platforms where is_hugepage_only_range()
      returns false always (everything except powerpc and ia64).  We address this
      by ensuring that hugetlb_free_pgd_range() is defined to be identical to
      free_pgd_range() (instead of a no-op) on everything except ia64.  Even so,
      it will prevent some otherwise possible coalescing of calls down to
      free_pgd_range().  Since this only happens for hugepage VMAs, removing this
      small optimization seems unlikely to cause any trouble.
      
      This patch causes no regressions on the libhugetlbfs testsuite - ppc64
      POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP).
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9da61aef
    • David Gibson's avatar
      [PATCH] hugepage: Make {alloc,free}_huge_page() local · 27a85ef1
      David Gibson authored
      
      
      Originally, mm/hugetlb.c just handled the hugepage physical allocation path
      and its {alloc,free}_huge_page() functions were used from the arch specific
      hugepage code.  These days those functions are only used with mm/hugetlb.c
      itself.  Therefore, this patch makes them static and removes their
      prototypes from hugetlb.h.  This requires a small rearrangement of code in
      mm/hugetlb.c to avoid a forward declaration.
      
      This patch causes no regressions on the libhugetlbfs testsuite (ppc64,
      POWER5).
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      27a85ef1
    • David Gibson's avatar
      [PATCH] hugepage: Strict page reservation for hugepage inodes · b45b5bd6
      David Gibson authored
      
      
      These days, hugepages are demand-allocated at first fault time.  There's a
      somewhat dubious (and racy) heuristic when making a new mmap() to check if
      there are enough available hugepages to fully satisfy that mapping.
      
      A particularly obvious case where the heuristic breaks down is where a
      process maps its hugepages not as a single chunk, but as a bunch of
      individually mmap()ed (or shmat()ed) blocks without touching and
      instantiating the pages in between allocations.  In this case the size of
      each block is compared against the total number of available hugepages.
      It's thus easy for the process to become overcommitted, because each block
      mapping will succeed, although the total number of hugepages required by
      all blocks exceeds the number available.  In particular, this defeats such
      a program which will detect a mapping failure and adjust its hugepage usage
      downward accordingly.
      
      The patch below addresses this problem, by strictly reserving a number of
      physical hugepages for hugepage inodes which have been mapped, but not
      instatiated.  MAP_SHARED mappings are thus "safe" - they will fail on
      mmap(), not later with an OOM SIGKILL.  MAP_PRIVATE mappings can still
      trigger an OOM.  (Actually SHARED mappings can technically still OOM, but
      only if the sysadmin explicitly reduces the hugepage pool between mapping
      and instantiation)
      
      This patch appears to address the problem at hand - it allows DB2 to start
      correctly, for instance, which previously suffered the failure described
      above.
      
      This patch causes no regressions on the libhugetblfs testsuite, and makes a
      test (designed to catch this problem) pass which previously failed (ppc64,
      POWER5).
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b45b5bd6
    • Zhang, Yanmin's avatar
      [PATCH] Enable mprotect on huge pages · 8f860591
      Zhang, Yanmin authored
      
      
      2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
      mprotect.
      
      From: David Gibson <david@gibson.dropbear.id.au>
      
        Remove a test from the mprotect() path which checks that the mprotect()ed
        range on a hugepage VMA is hugepage aligned (yes, really, the sense of
        is_aligned_hugepage_range() is the opposite of what you'd guess :-/).
      
        In fact, we don't need this test.  If the given addresses match the
        beginning/end of a hugepage VMA they must already be suitably aligned.  If
        they don't, then mprotect_fixup() will attempt to split the VMA.  The very
        first test in split_vma() will check for a badly aligned address on a
        hugepage VMA and return -EINVAL if necessary.
      
      From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      
        On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE.  The
        identify of hugetlb pte is lost when changing page protection via mprotect.
        A page fault occurs later will trigger a bug check in huge_pte_alloc().
      
        The fix is to always make new pte a hugetlb pte and also to clean up
        legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.
      Signed-off-by: default avatarZhang Yanmin <yanmin.zhang@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8f860591
  23. 06 Jan, 2006 1 commit
  24. 14 Nov, 2005 1 commit
    • Robin Holt's avatar
      [PATCH] mm: ZAP_BLOCK causes redundant work · 51c6f666
      Robin Holt authored
      
      
      The address based work estimate for unmapping (for lockbreak) is and always
      was horribly inefficient for sparse mappings.  The problem is most simply
      explained with an example:
      
      If we find a pgd is clear, we still have to call into unmap_page_range
      PGDIR_SIZE / ZAP_BLOCK_SIZE times, each time checking the clear pgd, in
      order to progress the working address to the next pgd.
      
      The fundamental way to solve the problem is to keep track of the end
      address we've processed and pass it back to the higher layers.
      
      From: Nick Piggin <npiggin@suse.de>
      
        Modification to completely get away from address based work estimate
        and instead use an abstract count, with a very small cost for empty
        entries as opposed to present pages.
      
        On 2.6.14-git2, ppc64, and CONFIG_PREEMPT=y, mapping and unmapping 1TB
        of virtual address space takes 1.69s; with the following patch applied,
        this operation can be done 1000 times in less than 0.01s
      
      From: Andrew Morton <akpm@osdl.org>
      
      With CONFIG_HUTETLB_PAGE=n:
      
      mm/memory.c: In function `unmap_vmas':
      mm/memory.c:779: warning: division by zero
      
      Due to
      
      			zap_work -= (end - start) /
      					(HPAGE_SIZE / PAGE_SIZE);
      
      So make the dummy HPAGE_SIZE non-zero
      Signed-off-by: default avatarRobin Holt <holt@sgi.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      51c6f666
  25. 30 Oct, 2005 1 commit
    • Hugh Dickins's avatar
      [PATCH] mm: unmap_vmas with inner ptlock · 508034a3
      Hugh Dickins authored
      
      
      Remove the page_table_lock from around the calls to unmap_vmas, and replace
      the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
      now safe to descend without page_table_lock.
      
      Don't attempt fancy locking for hugepages, just take page_table_lock in
      unmap_hugepage_range.  Which makes zap_hugepage_range, and the hugetlb test in
      zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway.  Nor
      does unmap_vmas have much use for its mm arg now.
      
      The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
      page_table_lock: if they're implemented at all, they typically come down to
      flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
      (which we already audited for the mprotect case).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      508034a3