1. 13 Dec, 2016 8 commits
  2. 06 Dec, 2016 1 commit
    • Linus Torvalds's avatar
      shmem: fix shm fallocate() list corruption · 10d20bd2
      Linus Torvalds authored
      The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not
      want to race with generating new pages by faulting them in.
      
      However, the wait-queue used to delay the page faulting has a serious
      problem: the wait queue head (in shmem_fallocate()) is allocated on the
      stack, and the code expects that "wake_up_all()" will make sure that all
      the queue entries are gone before the stack frame is de-allocated.
      
      And that is not at all necessarily the case.
      
      Yes, a normal wake-up sequence will remove the wait-queue entry that
      caused the wakeup (see "autoremove_wake_function()"), but the key
      wording there is "that caused the wakeup".  When there are multiple
      possible wakeup sources, the wait queue entry may well stay around.
      
      And _particularly_ in a page fault path, we may be faulting in new pages
      from user space while we also have other things going on, and there may
      well be other pending wakeups.
      
      So despite the "wake_up_all()", it's not at all guaranteed that all list
      entries are removed from the wait queue head on the stack.
      
      Fix this by introducing a new wakeup function that removes the list
      entry unconditionally, even if the target process had already woken up
      for other reasons.  Use that "synchronous" function to set up the
      waiters in shmem_fault().
      
      This problem has never been seen in the wild afaik, but Dave Jones has
      reported it on and off while running trinity.  We thought we fixed the
      stack corruption with the blk-mq rq_list locking fix (commit
      7fe31130
      
      : "blk-mq: update hardware and software queues for sleeping
      alloc"), but it turns out there was _another_ stack corruptor hiding
      in the trinity runs.
      
      Vegard Nossum (also running trinity) was able to trigger this one fairly
      consistently, and made us look once again at the shmem code due to the
      faults often being in that area.
      
      Reported-and-tested-by: Vegard Nossum <vegard.nossum@oracle.com>.
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10d20bd2
  3. 03 Dec, 2016 2 commits
    • Michal Hocko's avatar
      mm, vmscan: add cond_resched() into shrink_node_memcg() · bd041733
      Michal Hocko authored
      Boris Zhmurov has reported RCU stalls during the kswapd reclaim:
      
        INFO: rcu_sched detected stalls on CPUs/tasks:
         23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
         (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
        Task dump for CPU 23:
        kswapd1         R  running task        0   148      2 0x00000008
        Call Trace:
          shrink_node+0xd2/0x2f0
          kswapd+0x2cb/0x6a0
          mem_cgroup_shrink_node+0x160/0x160
          kthread+0xbd/0xe0
          __switch_to+0x1fa/0x5c0
          ret_from_fork+0x1f/0x40
          kthread_create_on_node+0x180/0x180
      
      a closer code inspection has shown that we might indeed miss all the
      scheduling points in the reclaim path if no pages can be isolated from
      the LRU list.  This is a pathological case but other reports from Donald
      Buczek have shown that we might indeed hit such a path:
      
              clusterd-989   [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
               kswapd1-86    [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
        [...]
               kswapd1-86    [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1
      
      this is minute long snapshot which didn't take a single page from the
      LRU.  It is not entirely clear why only 1303 pages have been scanned
      during that time (maybe there was a heavy IRQ activity interfering).
      
      In any case it looks like we can really hit long periods without
      scheduling on non preemptive kernels so an explicit cond_resched() in
      shrink_node_memcg which is independent on the reclaim operation is due.
      
      Link: http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarBoris Zhmurov <bb@kernelpanic.ru>
      Tested-by: default avatarBoris Zhmurov <bb@kernelpanic.ru>
      Reported-by: default avatarDonald Buczek <buczek@molgen.mpg.de>
      Reported-by: default avatar"Christopher S. Aker" <caker@theshore.net>
      Reported-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd041733
    • Michal Hocko's avatar
      mm: workingset: fix NULL ptr in count_shadow_nodes · 20ab67a5
      Michal Hocko authored
      Commit 0a6b76dd ("mm: workingset: make shadow node shrinker memcg
      aware") has made the workingset shadow nodes shrinker memcg aware.  The
      implementation is not correct though because memcg_kmem_enabled() might
      become true while we are doing a global reclaim when the sc->memcg might
      be NULL which is exactly what Marek has seen:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000400
        IP: [<ffffffff8122d520>] mem_cgroup_node_nr_lru_pages+0x20/0x40
        PGD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 60 Comm: kswapd0 Tainted: G           O   4.8.10-12.pvops.qubes.x86_64 #1
        task: ffff880011863b00 task.stack: ffff880011868000
        RIP: mem_cgroup_node_nr_lru_pages+0x20/0x40
        RSP: e02b:ffff88001186bc70  EFLAGS: 00010293
        RAX: 0000000000000000 RBX: ffff88001186bd20 RCX: 0000000000000002
        RDX: 000000000000000c RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88001186bc70 R08: 28f5c28f5c28f5c3 R09: 0000000000000000
        R10: 0000000000006c34 R11: 0000000000000333 R12: 00000000000001f6
        R13: ffffffff81c6f6a0 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff880013c00000(0000) knlGS:ffff880013d00000
        CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000400 CR3: 00000000122f2000 CR4: 0000000000042660
        Call Trace:
          count_shadow_nodes+0x9a/0xa0
          shrink_slab.part.42+0x119/0x3e0
          shrink_node+0x22c/0x320
          kswapd+0x32c/0x700
          kthread+0xd8/0xf0
          ret_from_fork+0x1f/0x40
        Code: 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 3b 35 dd eb b1 00 55 48 89 e5 73 2c 89 d2 31 c9 31 c0 4c 63 ce 48 0f a3 ca 73 13 <4a> 8b b4 cf 00 04 00 00 41 89 c8 4a 03 84 c6 80 00 00 00 83 c1
        RIP  mem_cgroup_node_nr_lru_pages+0x20/0x40
         RSP <ffff88001186bc70>
        CR2: 0000000000000400
        ---[ end trace 100494b9edbdfc4d ]---
      
      This patch fixes the issue by checking sc->memcg rather than
      memcg_kmem_enabled() which is sufficient because shrink_slab makes sure
      that only memcg aware shrinkers will get non-NULL memcgs and only if
      memcg_kmem_enabled is true.
      
      Fixes: 0a6b76dd ("mm: workingset: make shadow node shrinker memcg aware")
      Link: http://lkml.kernel.org/r/20161201132156.21450-1-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarMarek Marczykowski-Górecki <marmarek@mimuw.edu.pl>
      Tested-by: default avatarMarek Marczykowski-Górecki <marmarek@mimuw.edu.pl>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20ab67a5
  4. 01 Dec, 2016 5 commits
  5. 29 Nov, 2016 1 commit
  6. 17 Nov, 2016 1 commit
    • Aaron Lu's avatar
      mremap: fix race between mremap() and page cleanning · 5d190420
      Aaron Lu authored
      Prior to 3.15, there was a race between zap_pte_range() and
      page_mkclean() where writes to a page could be lost.  Dave Hansen
      discovered by inspection that there is a similar race between
      move_ptes() and page_mkclean().
      
      We've been able to reproduce the issue by enlarging the race window with
      a msleep(), but have not been able to hit it without modifying the code.
      So, we think it's a real issue, but is difficult or impossible to hit in
      practice.
      
      The zap_pte_range() issue is fixed by commit 1cf35d47
      
      ("mm: split
      'tlb_flush_mmu()' into tlb flushing and memory freeing parts").  And
      this patch is to fix the race between page_mkclean() and mremap().
      
      Here is one possible way to hit the race: suppose a process mmapped a
      file with READ | WRITE and SHARED, it has two threads and they are bound
      to 2 different CPUs, e.g.  CPU1 and CPU2.  mmap returned X, then thread
      1 did a write to addr X so that CPU1 now has a writable TLB for addr X
      on it.  Thread 2 starts mremaping from addr X to Y while thread 1
      cleaned the page and then did another write to the old addr X again.
      The 2nd write from thread 1 could succeed but the value will get lost.
      
              thread 1                           thread 2
           (bound to CPU1)                    (bound to CPU2)
      
        1: write 1 to addr X to get a
           writeable TLB on this CPU
      
                                              2: mremap starts
      
                                              3: move_ptes emptied PTE for addr X
                                                 and setup new PTE for addr Y and
                                                 then dropped PTL for X and Y
      
        4: page laundering for N by doing
           fadvise FADV_DONTNEED. When done,
           pageframe N is deemed clean.
      
        5: *write 2 to addr X
      
                                              6: tlb flush for addr X
      
        7: munmap (Y, pagesize) to make the
           page unmapped
      
        8: fadvise with FADV_DONTNEED again
           to kick the page off the pagecache
      
        9: pread the page from file to verify
           the value. If 1 is there, it means
           we have lost the written 2.
      
        *the write may or may not cause segmentation fault, it depends on
        if the TLB is still on the CPU.
      
      Please note that this is only one specific way of how the race could
      occur, it didn't mean that the race could only occur in exact the above
      config, e.g. more than 2 threads could be involved and fadvise() could
      be done in another thread, etc.
      
      For anonymous pages, they could race between mremap() and page reclaim:
      THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
      PMD gets unmapped/splitted/pagedout before the flush tlb happened for
      the old huge PMD in move_page_tables() and we could still write data to
      it.  The normal anonymous page has similar situation.
      
      To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
      if any, did the flush before dropping the PTL.  If we did the flush for
      every move_ptes()/move_huge_pmd() call then we do not need to do the
      flush in move_pages_tables() for the whole range.  But if we didn't, we
      still need to do the whole range flush.
      
      Alternatively, we can track which part of the range is flushed in
      move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
      range in move_page_tables().  But that would require multiple tlb
      flushes for the different sub-ranges and should be less efficient than
      the single whole range flush.
      
      KBuild test on my Sandybridge desktop doesn't show any noticeable change.
      v4.9-rc4:
        real    5m14.048s
        user    32m19.800s
        sys     4m50.320s
      
      With this commit:
        real    5m13.888s
        user    32m19.330s
        sys     4m51.200s
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d190420
  7. 11 Nov, 2016 9 commits
  8. 09 Nov, 2016 1 commit
  9. 06 Nov, 2016 1 commit
    • Eryu Guan's avatar
      mm/filemap: don't allow partially uptodate page for pipes · 6d6d36bc
      Eryu Guan authored
      Starting from 4.9-rc1 kernel, I started noticing some test failures
      of sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
      testing on sub-page block size filesystems (tested both XFS and
      ext4), these syscalls start to return EIO in the tests. e.g.
      
      sendfile02    1  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
      sendfile02    2  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
      sendfile02    3  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
      sendfile02    4  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1
      
      This is because that in sub-page block size cases, we don't need the
      whole page to be uptodate, only the part we care about is uptodate
      is OK (if fs has ->is_partially_uptodate defined). But
      page_cache_pipe_buf_confirm() doesn't have the ability to check the
      partially-uptodate case, it needs the whole page to be uptodate. So
      it returns EIO in this case.
      
      This is a regression introduced by commit 82c156f8
      
       ("switch
      generic_file_splice_read() to use of ->read_iter()"). Prior to the
      change, generic_file_splice_read() doesn't allow partially-uptodate
      page either, so it worked fine.
      
      Fix it by skipping the partially-uptodate check if we're working on
      a pipe in do_generic_file_read(), so we read the whole page from
      disk as long as the page is not uptodate.
      Signed-off-by: default avatarEryu Guan <guaneryu@gmail.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6d6d36bc
  10. 31 Oct, 2016 1 commit
    • Kees Cook's avatar
      latent_entropy: Fix wrong gcc code generation with 64 bit variables · 58bea414
      Kees Cook authored
      
      
      The stack frame size could grow too large when the plugin used long long
      on 32-bit architectures when the given function had too many basic blocks.
      
      The gcc warning was:
      
      drivers/pci/hotplug/ibmphp_ebda.c: In function 'ibmphp_access_ebda':
      drivers/pci/hotplug/ibmphp_ebda.c:409:1: warning: the frame size of 1108 bytes is larger than 1024 bytes [-Wframe-larger-than=]
      
      This switches latent_entropy from u64 to unsigned long.
      
      Thanks to PaX Team and Emese Revfy for the patch.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      58bea414
  11. 28 Oct, 2016 6 commits
  12. 27 Oct, 2016 3 commits
    • Linus Torvalds's avatar
      Allow KASAN and HOTPLUG_MEMORY to co-exist when doing build testing · 67463e54
      Linus Torvalds authored
      No, KASAN may not be able to co-exist with HOTPLUG_MEMORY at runtime,
      but for build testing there is no reason not to allow them together.
      
      This hopefully means better build coverage and fewer embarrasing silly
      problems like the one fixed by commit 9db4f36e
      
       ("mm: remove unused
      variable in memory hotplug") in the future.
      
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67463e54
    • Linus Torvalds's avatar
      mm: remove unused variable in memory hotplug · 9db4f36e
      Linus Torvalds authored
      When I removed the per-zone bitlock hashed waitqueues in commit
      9dcb8b68
      
       ("mm: remove per-zone hashtable of bitlock waitqueues"), I
      removed all the magic hotplug memory initialization of said waitqueues
      too.
      
      But when I actually _tested_ the resulting build, I stupidly assumed
      that "allmodconfig" would enable memory hotplug.  And it doesn't,
      because it enables KASAN instead, which then disables hotplug memory
      support.
      
      As a result, my build test of the per-zone waitqueues was totally
      broken, and I didn't notice that the compiler warns about the now unused
      iterator variable 'i'.
      
      I guess I should be happy that that seems to be the worst breakage from
      my clearly horribly failed test coverage.
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9db4f36e
    • Linus Torvalds's avatar
      mm: remove per-zone hashtable of bitlock waitqueues · 9dcb8b68
      Linus Torvalds authored
      
      
      The per-zone waitqueues exist because of a scalability issue with the
      page waitqueues on some NUMA machines, but it turns out that they hurt
      normal loads, and now with the vmalloced stacks they also end up
      breaking gfs2 that uses a bit_wait on a stack object:
      
           wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
      
      where 'gh' can be a reference to the local variable 'mount_gh' on the
      stack of fill_super().
      
      The reason the per-zone hash table breaks for this case is that there is
      no "zone" for virtual allocations, and trying to look up the physical
      page to get at it will fail (with a BUG_ON()).
      
      It turns out that I actually complained to the mm people about the
      per-zone hash table for another reason just a month ago: the zone lookup
      also hurts the regular use of "unlock_page()" a lot, because the zone
      lookup ends up forcing several unnecessary cache misses and generates
      horrible code.
      
      As part of that earlier discussion, we had a much better solution for
      the NUMA scalability issue - by just making the page lock have a
      separate contention bit, the waitqueue doesn't even have to be looked at
      for the normal case.
      
      Peter Zijlstra already has a patch for that, but let's see if anybody
      even notices.  In the meantime, let's fix the actual gfs2 breakage by
      simplifying the bitlock waitqueues and removing the per-zone issue.
      Reported-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Tested-by: default avatarBob Peterson <rpeterso@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9dcb8b68
  13. 25 Oct, 2016 1 commit
    • Lorenzo Stoakes's avatar
      mm: unexport __get_user_pages() · 0d731759
      Lorenzo Stoakes authored
      
      
      This patch unexports the low-level __get_user_pages() function.
      
      Recent refactoring of the get_user_pages* functions allow flags to be
      passed through get_user_pages() which eliminates the need for access to
      this function from its one user, kvm.
      
      We can see that the two calls to get_user_pages() which replace
      __get_user_pages() in kvm_main.c are equivalent by examining their call
      stacks:
      
        get_user_page_nowait():
          get_user_pages(start, 1, flags, page, NULL)
          __get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
      			    false, flags | FOLL_TOUCH)
          __get_user_pages(current, current->mm, start, 1,
      		     flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)
      
        check_user_page_hwpoison():
          get_user_pages(addr, 1, flags, NULL, NULL)
          __get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
      			    false, flags | FOLL_TOUCH)
          __get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
      		     NULL, NULL)
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d731759