Skip to content
  • Davidlohr Bueso's avatar
    mm, hugetlb: improve page-fault scalability · 8382d914
    Davidlohr Bueso authored
    
    
    The kernel can currently only handle a single hugetlb page fault at a
    time.  This is due to a single mutex that serializes the entire path.
    This lock protects from spurious OOM errors under conditions of low
    availability of free hugepages.  This problem is specific to hugepages,
    because it is normal to want to use every single hugepage in the system
    - with normal pages we simply assume there will always be a few spare
    pages which can be used temporarily until the race is resolved.
    
    Address this problem by using a table of mutexes, allowing a better
    chance of parallelization, where each hugepage is individually
    serialized.  The hash key is selected depending on the mapping type.
    For shared ones it consists of the address space and file offset being
    faulted; while for private ones the mm and virtual address are used.
    The size of the table is selected based on a compromise of collisions
    and memory footprint of a series of database workloads.
    
    Large database workloads that make heavy use of hugepages can be
    particularly exposed to this issue, causing start-up times to be
    painfully slow.  This patch reduces the startup time of a 10 Gb Oracle
    DB (with ~5000 faults) from 37.5 secs to 25.7 secs.  Larger workloads
    will naturally benefit even more.
    
    NOTE:
    The only downside to this patch, detected by Joonsoo Kim, is that a
    small race is possible in private mappings: A child process (with its
    own mm, after cow) can instantiate a page that is already being handled
    by the parent in a cow fault.  When low on pages, can trigger spurious
    OOMs.  I have not been able to think of a efficient way of handling
    this...  but do we really care about such a tiny window? We already
    maintain another theoretical race with normal pages.  If not, one
    possible way to is to maintain the single hash for private mappings --
    any workloads that *really* suffer from this scaling problem should
    already use shared mappings.
    
    [akpm@linux-foundation.org: remove stray + characters, go BUG if hugetlb_init() kmalloc fails]
    Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
    Cc: David Gibson <david@gibson.dropbear.id.au>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    8382d914