Skip to content
Snippets Groups Projects
  1. Mar 06, 2019
    • Joel Fernandes (Google)'s avatar
      mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd · ab3948f5
      Joel Fernandes (Google) authored
      Android uses ashmem for sharing memory regions.  We are looking forward
      to migrating all usecases of ashmem to memfd so that we can possibly
      remove the ashmem driver in the future from staging while also
      benefiting from using memfd and contributing to it.  Note staging
      drivers are also not ABI and generally can be removed at anytime.
      
      One of the main usecases Android has is the ability to create a region
      and mmap it as writeable, then add protection against making any
      "future" writes while keeping the existing already mmap'ed
      writeable-region active.  This allows us to implement a usecase where
      receivers of the shared memory buffer can get a read-only view, while
      the sender continues to write to the buffer.  See CursorWindow
      documentation in Android for more details:
      
        https://developer.android.com/reference/android/database/CursorWindow
      
      This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
      To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
      which prevents any future mmap and write syscalls from succeeding while
      keeping the existing mmap active.
      
      A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
      where we don't need to modify core VFS structures to get the same
      behavior of the seal.  This solves several side-effects pointed by Andy.
      self-tests are provided in later patch to verify the expected semantics.
      
      [1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/
      
      Thanks a lot to Andy for suggestions to improve code.
      
      Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.org
      
      
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Acked-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: J. Bruce Fields <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab3948f5
    • Vineeth Remanan Pillai's avatar
      mm: rid swapoff of quadratic complexity · b56a2d8a
      Vineeth Remanan Pillai authored
      This patch was initially posted by Kelley Nielsen.  Reposting the patch
      with all review comments addressed and with minor modifications and
      optimizations.  Also, folding in the fixes offered by Hugh Dickins and
      Huang Ying.  Tests were rerun and commit message updated with new
      results.
      
      try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
      It unuses swap entries one by one, potentially iterating over all the
      page tables for all the processes in the system for each one.
      
      This new proposed implementation of try_to_unuse simplifies its
      complexity to linear.  It iterates over the system's mms once, unusing
      all the affected entries as it walks each set of page tables.  It also
      makes similar changes to shmem_unuse.
      
      Improvement
      
      swapoff was called on a swap partition containing about 6G of data, in a
      VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.
      
      Present implementation....about 1200M calls(8min, avg 80% cpu util).
      Prototype.................about  9.0K calls(3min, avg 5% cpu util).
      
      Details
      
      In shmem_unuse(), iterate over the shmem_swaplist and, for each
      shmem_inode_info that contains a swap entry, pass it to
      shmem_unuse_inode(), along with the swap type.  In shmem_unuse_inode(),
      iterate over its associated xarray, and store the index and value of
      each swap entry in an array for passing to shmem_swapin_page() outside
      of the RCU critical section.
      
      In try_to_unuse(), instead of iterating over the entries in the type and
      unusing them one by one, perhaps walking all the page tables for all the
      processes for each one, iterate over the mmlist, making one pass.  Pass
      each mm to unuse_mm() to begin its page table walk, and during the walk,
      unuse all the ptes that have backing store in the swap type received by
      try_to_unuse().  After the walk, check the type for orphaned swap
      entries with find_next_to_unuse(), and remove them from the swap cache.
      If find_next_to_unuse() starts over at the beginning of the type, repeat
      the check of the shmem_swaplist and the walk a maximum of three times.
      
      Change unuse_mm() and the intervening walk functions down to
      unuse_pte_range() to take the type as a parameter, and to iterate over
      their entire range, calling the next function down on every iteration.
      In unuse_pte_range(), make a swap entry from each pte in the range using
      the passed in type.  If it has backing store in the type, call
      swapin_readahead() to retrieve the page and pass it to unuse_pte().
      
      Pass the count of pages_to_unuse down the page table walks in
      try_to_unuse(), and return from the walk when the desired number of
      pages has been swapped back in.
      
      Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com
      
      
      Signed-off-by: default avatarVineeth Remanan Pillai <vpillai@digitalocean.com>
      Signed-off-by: default avatarKelley Nielsen <kelleynnn@gmail.com>
      Signed-off-by: default avatarHuang Ying <ying.huang@intel.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b56a2d8a
    • Vineeth Remanan Pillai's avatar
      mm: refactor swap-in logic out of shmem_getpage_gfp · c5bf121e
      Vineeth Remanan Pillai authored
      swapin logic can be reused independently without rest of the logic in
      shmem_getpage_gfp.  So lets refactor it out as an independent function.
      
      Link: http://lkml.kernel.org/r/20190114153129.4852-1-vpillai@digitalocean.com
      
      
      Signed-off-by: default avatarVineeth Remanan Pillai <vpillai@digitalocean.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kelley Nielsen <kelleynnn@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5bf121e
  2. Feb 25, 2019
  3. Feb 21, 2019
  4. Dec 28, 2018
  5. Dec 08, 2018
    • David Rientjes's avatar
      Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask" · 356ff8a9
      David Rientjes authored
      
      This reverts commit 89c83fb5.
      
      This should have been done as part of 2f0799a0 ("mm, thp: restore
      node-local hugepage allocations").  The movement of the thp allocation
      policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was
      intended to only set __GFP_THISNODE for mempolicies that are not
      MPOL_BIND whereas the revert could set this regardless of mempolicy.
      
      While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask()
      and alloc_pages_vma() was racy, that has since been removed since the
      revert.  What is left is the possibility to use __GFP_THISNODE in
      policy_node() when it is unexpected because the special handling for
      hugepages in alloc_pages_vma()  was removed as part of the consolidation.
      
      Secondly, prior to 89c83fb5, alloc_pages_vma() implemented a somewhat
      different policy for hugepage allocations, which were allocated through
      alloc_hugepage_vma().  For hugepage allocations, if the allocating
      process's node is in the set of allowed nodes, allocate with
      __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with
      __GFP_THISNODE instead).  This was changed for shmem_alloc_hugepage() to
      allow fallback to other nodes in 89c83fb5 as it did for new_page() in
      mm/mempolicy.c which is functionally different behavior and removes the
      requirement to only allocate hugepages locally.
      
      So this commit does a full revert of 89c83fb5 instead of the partial
      revert that was done in 2f0799a0.  The result is the same thp
      allocation policy for 4.20 that was in 4.19.
      
      Fixes: 89c83fb5 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
      Fixes: 2f0799a0 ("mm, thp: restore node-local hugepage allocations")
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      356ff8a9
  6. Dec 06, 2018
  7. Nov 30, 2018
  8. Nov 18, 2018
  9. Nov 07, 2018
  10. Nov 03, 2018
    • Michal Hocko's avatar
      mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask · 89c83fb5
      Michal Hocko authored
      THP allocation mode is quite complex and it depends on the defrag mode.
      This complexity is hidden in alloc_hugepage_direct_gfpmask from a large
      part currently. The NUMA special casing (namely __GFP_THISNODE) is
      however independent and placed in alloc_pages_vma currently. This both
      adds an unnecessary branch to all vma based page allocation requests and
      it makes the code more complex unnecessarily as well. Not to mention
      that e.g. shmem THP used to do the node reclaiming unconditionally
      regardless of the defrag mode until recently. This was not only
      unexpected behavior but it was also hardly a good default behavior and I
      strongly suspect it was just a side effect of the code sharing more than
      a deliberate decision which suggests that such a layering is wrong.
      
      Get rid of the thp special casing from alloc_pages_vma and move the
      logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the
      resulting gfp mask only when the direct reclaim is not requested and
      when there is no explicit numa binding to preserve the current logic.
      
      Please note that there's also a slight difference wrt MPOL_BIND now. The
      previous code would avoid using __GFP_THISNODE if the local node was
      outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
      for all MPOL_BIND policies. So there's a difference that if local node
      is actually allowed by the bind policy's nodemask, previously
      __GFP_THISNODE would be added, but now it won't be. From the behavior
      POV this is still correct because the policy nodemask is used.
      
      Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89c83fb5
  11. Oct 21, 2018
  12. Sep 30, 2018
    • Matthew Wilcox's avatar
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox authored
      
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      citizens.
      
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      3159f943
  13. Sep 20, 2018
    • Joel Fernandes (Google)'s avatar
      mm: shmem.c: Correctly annotate new inodes for lockdep · b45d71fb
      Joel Fernandes (Google) authored
      Directories and inodes don't necessarily need to be in the same lockdep
      class.  For ex, hugetlbfs splits them out too to prevent false positives
      in lockdep.  Annotate correctly after new inode creation.  If its a
      directory inode, it will be put into a different class.
      
      This should fix a lockdep splat reported by syzbot:
      
      > ======================================================
      > WARNING: possible circular locking dependency detected
      > 4.18.0-rc8-next-20180810+ #36 Not tainted
      > ------------------------------------------------------
      > syz-executor900/4483 is trying to acquire lock:
      > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: inode_lock
      > include/linux/fs.h:765 [inline]
      > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at:
      > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
      >
      > but task is already holding lock:
      > 0000000025208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630
      > drivers/staging/android/ashmem.c:448
      >
      > which lock already depends on the new lock.
      >
      > -> #2 (ashmem_mutex){+.+.}:
      >        __mutex_lock_common kernel/locking/mutex.c:925 [inline]
      >        __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
      >        mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
      >        ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361
      >        call_mmap include/linux/fs.h:1844 [inline]
      >        mmap_region+0xf27/0x1c50 mm/mmap.c:1762
      >        do_mmap+0xa10/0x1220 mm/mmap.c:1535
      >        do_mmap_pgoff include/linux/mm.h:2298 [inline]
      >        vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357
      >        ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585
      >        __do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
      >        __se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
      >        __x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
      >        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      >        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      >
      > -> #1 (&mm->mmap_sem){++++}:
      >        __might_fault+0x155/0x1e0 mm/memory.c:4568
      >        _copy_to_user+0x30/0x110 lib/usercopy.c:25
      >        copy_to_user include/linux/uaccess.h:155 [inline]
      >        filldir+0x1ea/0x3a0 fs/readdir.c:196
      >        dir_emit_dot include/linux/fs.h:3464 [inline]
      >        dir_emit_dots include/linux/fs.h:3475 [inline]
      >        dcache_readdir+0x13a/0x620 fs/libfs.c:193
      >        iterate_dir+0x48b/0x5d0 fs/readdir.c:51
      >        __do_sys_getdents fs/readdir.c:231 [inline]
      >        __se_sys_getdents fs/readdir.c:212 [inline]
      >        __x64_sys_getdents+0x29f/0x510 fs/readdir.c:212
      >        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      >        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      >
      > -> #0 (&sb->s_type->i_mutex_key#9){++++}:
      >        lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
      >        down_write+0x8f/0x130 kernel/locking/rwsem.c:70
      >        inode_lock include/linux/fs.h:765 [inline]
      >        shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
      >        ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455
      >        ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797
      >        vfs_ioctl fs/ioctl.c:46 [inline]
      >        file_ioctl fs/ioctl.c:501 [inline]
      >        do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
      >        ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
      >        __do_sys_ioctl fs/ioctl.c:709 [inline]
      >        __se_sys_ioctl fs/ioctl.c:707 [inline]
      >        __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
      >        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      >        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      >
      > other info that might help us debug this:
      >
      > Chain exists of:
      >   &sb->s_type->i_mutex_key#9 --> &mm->mmap_sem --> ashmem_mutex
      >
      >  Possible unsafe locking scenario:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(ashmem_mutex);
      >                                lock(&mm->mmap_sem);
      >                                lock(ashmem_mutex);
      >   lock(&sb->s_type->i_mutex_key#9);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by syz-executor900/4483:
      >  #0: 0000000025208078 (ashmem_mutex){+.+.}, at:
      > ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448
      
      Link: http://lkml.kernel.org/r/20180821231835.166639-1-joel@joelfernandes.org
      
      
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Suggested-by: default avatarNeilBrown <neilb@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b45d71fb
  14. Aug 24, 2018
  15. Aug 22, 2018
  16. Aug 17, 2018
  17. Jul 27, 2018
  18. Jul 12, 2018
  19. Jul 09, 2018
  20. Jun 14, 2018
  21. Jun 08, 2018
    • Kirill A. Shutemov's avatar
      mm/shmem.c: zero out unused vma fields in shmem_pseudo_vma_init() · daa28075
      Kirill A. Shutemov authored
      shmem/tmpfs uses pseudo vma to allocate page with correct NUMA policy.
      
      The pseudo vma doesn't have vm_page_prot set.  We are going to encode
      encryption KeyID in vm_page_prot.  Having garbage there causes problems.
      
      Zero out all unused fields in the pseudo vma.
      
      Link: http://lkml.kernel.org/r/20180531135602.20321-1-kirill.shutemov@linux.intel.com
      
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      daa28075
    • Souptick Joarder's avatar
      mm/shmem.c: use new return type vm_fault_t · 20acce67
      Souptick Joarder authored
      Use new return type vm_fault_t for fault handler.  For now, this is just
      documenting that the function returns a VM_FAULT value rather than an
      errno.  Once all instances are converted, vm_fault_t will become a
      distinct type.
      
      See commit 1c8f4220 ("mm: change return type to vm_fault_t")
      
      vmf_error() is the newly introduce inline function in 4.17-rc6.
      
      Link: http://lkml.kernel.org/r/20180521202410.GA17912@jordon-HP-15-Notebook-PC
      
      
      Signed-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20acce67
    • Amir Goldstein's avatar
      tmpfs: allow decoding a file handle of an unlinked file · 12ba780d
      Amir Goldstein authored
      tmpfs uses the helper d_find_alias() to find a dentry from a decoded
      inode, but d_find_alias() skips unhashed dentries, so unlinked files
      cannot be decoded from a file handle.
      
      This can be reproduced using xfstests test program open_by_handle:
      
        $ open_by handle -c /tmp/testdir
        $ open_by_handle -dk /tmp/testdir
        open_by_handle(/tmp/testdir/file000000) returned 116 incorrectly on an unlinked open file!
      
      To fix this, if d_find_alias() can't find a hashed alias, call
      d_find_any_alias() to return an unhashed one.
      
      Link: http://lkml.kernel.org/r/CAOQ4uxg+qSLP0KwdW+h1tcPqOCQd+_pGZVXiePQB1TXCMBMctQ@mail.gmail.com
      
      
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12ba780d
    • Yang Shi's avatar
      mm: shmem: make stat.st_blksize return huge page size if THP is on · 89fdcd26
      Yang Shi authored
      Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
      filesystem with huge page support anymore.  tmpfs can use huge page via
      THP when mounting by "huge=" mount option.
      
      When applications use huge page on hugetlbfs, it just need check the
      filesystem magic number, but it is not enough for tmpfs.  Make
      stat.st_blksize return huge page size if it is mounted by appropriate
      "huge=" option to give applications a hint to optimize the behavior with
      THP.
      
      Some applications may not do wisely with THP.  For example, QEMU may
      mmap file on non huge page aligned hint address with MAP_FIXED, which
      results in no pages are PMD mapped even though THP is used.  Some
      applications may mmap file with non huge page aligned offset.  Both
      behaviors make THP pointless.
      
      statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and
      it also may fallback to 4KB page silently if there is not enough huge
      page.  Furthermore, different f_bsize makes max_blocks and free_blocks
      calculation harder but without too much benefit.  Returning huge page
      size via stat.st_blksize sounds good enough.
      
      Since PUD size huge page for THP has not been supported, now it just
      returns HPAGE_PMD_SIZE.
      
      Hugh said:
      
      : Sorry, I have no enthusiasm for this patch; but do I feel strongly
      : enough to override you and everyone else to NAK it?  No, I don't feel
      : that strongly, maybe st_blksize isn't worth arguing over.
      :
      : We did look at struct stat when designing huge tmpfs, to see if there
      : were any fields that should be adjusted for it; but concluded none.
      : Yes, it would sometimes be nice to have a quickly accessible indicator
      : for when tmpfs has been mounted huge (scanning /proc/mounts for options
      : can be tiresome, agreed); but since tmpfs tries to supply huge (or not)
      : pages transparently, no difference seemed right.
      :
      : So, because st_blksize is a not very useful field of struct stat, with
      : "size" in the name, we're going to put HPAGE_PMD_SIZE in there instead
      : of PAGE_SIZE, if the tmpfs was mounted with one of the huge "huge"
      : options (force or always, okay; within_size or advise, not so much).
      : Though HPAGE_PMD_SIZE is no more its "preferred I/O size" or "blocksize
      : for file system I/O" than PAGE_SIZE was.
      :
      : Which we can expect to speed up some applications and disadvantage
      : others, depending on how they interpret st_blksize: just like if we
      : changed it in the same way on non-huge tmpfs.  (Did I actually try
      : changing st_blksize early on, and find it broke something?  If so, I've
      : now forgotten what, and a search through commit messages didn't find
      : it; but I guess we'll find out soon enough.)
      :
      : If there were an mstat() syscall, returning a field "preferred
      : alignment", then we could certainly agree to put HPAGE_PMD_SIZE in
      : there; but in stat()'s st_blksize?  And what happens when (in future)
      : mm maps this or that hard-disk filesystem's blocks with a pmd mapping -
      : should that filesystem then advertise a bigger st_blksize, despite the
      : same disk layout as before?  What happens with DAX?
      :
      : And this change is not going to help the QEMU suboptimality that
      : brought you here (or does QEMU align mmaps according to st_blksize?).
      : QEMU ought to work well with kernels without this change, and kernels
      : with this change; and I hope it can easily deal with both by avoiding
      : that use of MAP_FIXED which prevented the kernel's intended alignment.
      
      [akpm@linux-foundation.org: remove unneeded `else']
      Link: http://lkml.kernel.org/r/1524665633-83806-1-git-send-email-yang.shi@linux.alibaba.com
      
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89fdcd26
Loading