1. 17 Mar, 2016 38 commits
    • Andrey Ryabinin's avatar
      mm: deduplicate memory overcommitment code · 39a1aa8e
      Andrey Ryabinin authored
      
      
      Currently we have two copies of the same code which implements memory
      overcommitment logic.  Let's move it into mm/util.c and hence avoid
      duplication.  No functional changes here.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39a1aa8e
    • Andrey Ryabinin's avatar
      mm: move max_map_count bits into mm.h · ea606cf5
      Andrey Ryabinin authored
      
      
      max_map_count sysctl unrelated to scheduler. Move its bits from
      include/linux/sched/sysctl.h to include/linux/mm.h.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea606cf5
    • Kirill A. Shutemov's avatar
      thp, vmstats: count deferred split events · f9719a03
      Kirill A. Shutemov authored
      
      
      Count how many times we put a THP in split queue.  Currently, it happens
      on partial unmap of a THP.
      
      Rapidly growing value can indicate that an application behaves
      unfriendly wrt THP: often fault in huge page and then unmap part of it.
      This leads to unnecessary memory fragmentation and the application may
      require tuning.
      
      The event also can help with debugging kernel [mis-]behaviour.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9719a03
    • Vladimir Davydov's avatar
      mm: workingset: make shadow node shrinker memcg aware · 0a6b76dd
      Vladimir Davydov authored
      
      
      Workingset code was recently made memcg aware, but shadow node shrinker
      is still global.  As a result, one small cgroup can consume all memory
      available for shadow nodes, possibly hurting other cgroups by reclaiming
      their shadow nodes, even though reclaim distances stored in its shadow
      nodes have no effect.  To avoid this, we need to make shadow node
      shrinker memcg aware.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a6b76dd
    • Vladimir Davydov's avatar
      mm: workingset: size shadow nodes lru basing on file cache size · cdcbb72e
      Vladimir Davydov authored
      
      
      A page is activated on refault if the refault distance stored in the
      corresponding shadow entry is less than the number of active file pages.
      Since active file pages can't occupy more than half memory, we assume
      that the maximal effective refault distance can't be greater than half
      the number of present pages and size the shadow nodes lru list
      appropriately.  Generally speaking, this assumption is correct, but it
      can result in wasting a considerable chunk of memory on stale shadow
      nodes in case the portion of file pages is small, e.g.  if a workload
      mostly uses anonymous memory.
      
      To sort this out, we need to compute the size of shadow nodes lru basing
      not on the maximal possible, but the current size of file cache.  We
      could take the size of active file lru for the maximal refault distance,
      but active lru is pretty unstable - it can shrink dramatically at
      runtime possibly disrupting workingset detection logic.
      
      Instead we assume that the maximal refault distance equals half the
      total number of file cache pages.  This will protect us against active
      file lru size fluctuations while still being correct, because size of
      active lru is normally maintained lower than size of inactive lru.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdcbb72e
    • Vladimir Davydov's avatar
      radix-tree: account radix_tree_node to memory cgroup · 58e698af
      Vladimir Davydov authored
      
      
      Allocation of radix_tree_node objects can be easily triggered from
      userspace, so we should account them to memory cgroup.  Besides, we need
      them accounted for making shadow node shrinker per memcg (see
      mm/workingset.c).
      
      A tricky thing about accounting radix_tree_node objects is that they are
      mostly allocated through radix_tree_preload(), so we can't just set
      SLAB_ACCOUNT for radix_tree_node_cachep - that would likely result in a
      lot of unrelated cgroups using objects from each other's caches.
      
      One way to overcome this would be making radix tree preloads per memcg,
      but that would probably look cumbersome and overcomplicated.
      
      Instead, we make radix_tree_node_alloc() first try to allocate from the
      cache with __GFP_ACCOUNT, no matter if the caller has preloaded or not,
      and only if it fails fall back on using per cpu preloads.  This should
      make most allocations accounted.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58e698af
    • Vladimir Davydov's avatar
      mm: memcontrol: zap memcg_kmem_online helper · b6ecd2de
      Vladimir Davydov authored
      
      
      As kmem accounting is now either enabled for all cgroups or disabled
      system-wide, there's no point in having memcg_kmem_online() helper -
      instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
      shrink_slab() now does.
      
      There are only two places left where this helper is used -
      __memcg_kmem_charge() and memcg_create_kmem_cache().  The former can
      only be called if memcg_kmem_enabled() returned true.  Since the cgroup
      it operates on is online, mem_cgroup_is_root() check will be enough.
      
      memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
      of memcg_kmem_online(), because it relies on the fact that in
      memcg_offline_kmem() memcg->kmem_state is changed before
      memcg_deactivate_kmem_caches() is called, but there we can just
      open-code the check.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6ecd2de
    • Vladimir Davydov's avatar
      mm: vmscan: pass root_mem_cgroup instead of NULL to memcg aware shrinker · 0fc9f58a
      Vladimir Davydov authored
      
      
      It's just convenient to implement a memcg aware shrinker when you know
      that shrink_control->memcg != NULL unless memcg_kmem_enabled() returns
      false.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0fc9f58a
    • Vladimir Davydov's avatar
      mm: memcontrol: enable kmem accounting for all cgroups in the legacy hierarchy · b313aeee
      Vladimir Davydov authored
      
      
      Workingset code was recently made memcg aware, but shadow node shrinker
      is still global.  As a result, one small cgroup can consume all memory
      available for shadow nodes, possibly hurting other cgroups by reclaiming
      their shadow nodes, even though reclaim distances stored in its shadow
      nodes have no effect.  To avoid this, we need to make shadow node
      shrinker memcg aware.
      
      The actual work is done in patch 6 of the series.  Patches 1 and 2
      prepare memcg/shrinker infrastructure for the change.  Patch 3 is just a
      collateral cleanup.  Patch 4 makes radix_tree_node accounted, which is
      necessary for making shadow node shrinker memcg aware.  Patch 5 reduces
      shadow nodes overhead in case workload mostly uses anonymous pages.
      
      This patch:
      
      Currently, in the legacy hierarchy kmem accounting is off for all
      cgroups by default and must be enabled explicitly by writing something
      to memory.kmem.limit_in_bytes.  Since we don't support reclaim on
      hitting kmem limit, nor do we have any plans to implement it, this is
      likely to be -1, just to enable kmem accounting and limit kernel memory
      consumption by the memory.limit_in_bytes along with user memory.
      
      This user API was introduced when the implementation of kmem accounting
      lacked slab shrinker support and hence was useless in practice.  Things
      have changed since then - slab shrinkers were made memcg aware, the
      accounting overhead seems to be negligible, and a failure to charge a
      kmem allocation should not have critical consequences, because we only
      account those kernel objects that should be safe to fail.  That's why
      kmem accounting is enabled by default for all cgroups in the default
      hierarchy, which will eventually replace the legacy one.
      
      The ability to enable kmem accounting for some cgroups while keeping it
      disabled for others is getting difficult to maintain.  E.g.  to make
      shadow node shrinker memcg aware (see mm/workingset.c), we need to know
      the relationship between the number of shadow nodes allocated for a
      cgroup and the size of its lru list.  If kmem accounting is enabled for
      all cgroups there is no problem, but what should we do if kmem
      accounting is enabled only for half of cgroups? We've no other choice
      but use global lru stats while scanning root cgroup's shadow nodes, but
      that would be wrong if kmem accounting was enabled for all cgroups
      (which is the case if the unified hierarchy is used), in which case we
      should use lru stats of the root cgroup's lruvec.
      
      That being said, let's enable kmem accounting for all memory cgroups by
      default.  If one finds it unstable or too costly, it can always be
      disabled system-wide by passing cgroup.memory=nokmem to the kernel at
      boot time.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b313aeee
    • Denys Vlasenko's avatar
      include/linux/page-flags.h: force inlining of selected page flag modifications · 4b0f3261
      Denys Vlasenko authored
      Sometimes gcc mysteriously doesn't inline
      very small functions we expect to be inlined. See
      
          https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      
      With this .config:
      http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os
      
      ,
      the following functions get deinlined many times.
      Examples of disassembly:
      
      <SetPageUptodate> (43 copies, 141 calls):
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             f0 80 0f 08             lock orb $0x8,(%rdi)
             5d                      pop    %rbp
             c3                      retq
      
      <PagePrivate> (10 copies, 134 calls):
             48 8b 07                mov    (%rdi),%rax
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             48 c1 e8 0b             shr    $0xb,%rax
             83 e0 01                and    $0x1,%eax
             5d                      pop    %rbp
             c3                      retq
      
      This patch fixes this via s/inline/__always_inline/.
      
      Code size decrease after the patch is ~7k:
      
          text     data      bss       dec     hex filename
      92125002 20826048 36417536 149368586 8e72f0a vmlinux
      92118087 20826112 36417536 149361735 8e71447 vmlinux7_pageops_after
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b0f3261
    • Denys Vlasenko's avatar
      bufferhead: force inlining of buffer head flag operations · ee91ef61
      Denys Vlasenko authored
      With both gcc 4.7.2 and 4.9.2, sometimes gcc mysteriously doesn't inline
      very small functions we expect to be inlined.  See
      
          https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      
      With this .config:
      http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os
      
      ,
      set_buffer_foo(), clear_buffer_foo() and similar functions get deinlined
      about 60 times. Examples of disassembly:
      
      <set_buffer_mapped> (14 copies, 43 calls):
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             f0 80 0f 20             lock orb $0x20,(%rdi)
             5d                      pop    %rbp
             c3                      retq
      <buffer_mapped> (3 copies, 34 calls):
             48 8b 07                mov    (%rdi),%rax
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             48 c1 e8 05             shr    $0x5,%rax
             83 e0 01                and    $0x1,%eax
             5d                      pop    %rbp
             c3                      retq
      <set_buffer_new> (5 copies, 13 calls):
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             f0 80 0f 40             lock orb $0x40,(%rdi)
             5d                      pop    %rbp
             c3                      retq
      
      This patch fixes this via s/inline/__always_inline/.
      This decreases vmlinux by about 3 kbytes.
      
          text	    data	     bss	      dec	    hex	filename
      88200439	19905208	36421632	144527279	89d4faf	vmlinux2
      88197239	19905240	36421632	144524111	89d434f	vmlinux
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee91ef61
    • Konstantin Khlebnikov's avatar
      tools/vm/page-types.c: add memory cgroup dumping and filtering · 075db150
      Konstantin Khlebnikov authored
      
      
      This adds two command line keys:
      
       -c|--cgroup path|@inode	Walk only pages owned by this memory cgroup
       -C|--list-cgroup		Show memory cgroup inodes
      
      [vdavydov@virtuozzo.com: opt_cgroup should be uint64_t.  Fix conflicts with "tools/vm/page-types.c: support swap entry"]
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      075db150
    • Vlastimil Babka's avatar
      mm, kswapd: replace kswapd compaction with waking up kcompactd · accf6242
      Vlastimil Babka authored
      
      
      Similarly to direct reclaim/compaction, kswapd attempts to combine
      reclaim and compaction to attempt making memory allocation of given
      order available.
      
      The details differ from direct reclaim e.g. in having high watermark as
      a goal.  The code involved in kswapd's reclaim/compaction decisions has
      evolved to be quite complex.
      
      Testing reveals that it doesn't actually work in at least one scenario,
      and closer inspection suggests that it could be greatly simplified
      without compromising on the goal (make high-order page available) or
      efficiency (don't reclaim too much).  The simplification relieas of
      doing all compaction in kcompactd, which is simply woken up when high
      watermarks are reached by kswapd's reclaim.
      
      The scenario where kswapd compaction doesn't work was found with mmtests
      test stress-highalloc configured to attempt order-9 allocations without
      direct reclaim, just waking up kswapd.  There was no compaction attempt
      from kswapd during the whole test.  Some added instrumentation shows
      what happens:
      
       - balance_pgdat() sets end_zone to Normal, as it's not balanced
       - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
         it cannot reclaim anything, so sc.nr_reclaimed is 0
       - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
         it merely checks if high watermarks were reached for base pages.
         This is true, so no reclaim is attempted.  For DMA, testorder=0
         wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
       - even though the pgdat_needs_compaction flag wasn't set to false, no
         compaction happens due to the condition sc.nr_reclaimed >
         nr_attempted being false (as 0 < 99)
       - priority-- due to nr_reclaimed being 0, repeat until priority reaches
         0 pgdat_balanced() is false as only the small zone DMA appears
         balanced (curiously in that check, watermark appears OK and
         compaction_suitable() returns COMPACT_PARTIAL, because a lower
         classzone_idx is used there)
      
      Now, even if it was decided that reclaim shouldn't be attempted on the
      DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
      nr_attempted=0) is also false.  The condition really should use >= as
      the comment suggests.  Then there is a mismatch in the check for setting
      pgdat_needs_compaction to false using low watermark, while the rest uses
      high watermark, and who knows what other subtlety.  Hopefully this
      demonstrates that this is unsustainable.
      
      Luckily we can simplify this a lot.  The reclaim/compaction decisions
      make sense for direct reclaim scenario, but in kswapd, our primary goal
      is to reach high watermark in order-0 pages.  Afterwards we can attempt
      compaction just once.  Unlike direct reclaim, we don't reclaim extra
      pages (over the high watermark), the current code already disallows it
      for good reasons.
      
      After this patch, we simply wake up kcompactd to process the pgdat,
      after we have either succeeded or failed to reach the high watermarks in
      kswapd, which goes to sleep.  We pass kswapd's order and classzone_idx,
      so kcompactd can apply the same criteria to determine which zones are
      worth compacting.  Note that we use the classzone_idx from
      wakeup_kswapd(), not balanced_classzone_idx which can include higher
      zones that kswapd tried to balance too, but didn't consider them in
      pgdat_balanced().
      
      Since kswapd now cannot create high-order pages itself, we need to
      adjust how it determines the zones to be balanced.  The key element here
      is adding a "highorder" parameter to zone_balanced, which, when set to
      false, makes it consider only order-0 watermark instead of the desired
      higher order (this was done previously by kswapd_shrink_zone(), but not
      elsewhere).  This false is passed for example in pgdat_balanced().
      Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
      kcompactd are woken up for a high-order allocation failure.
      
      The last thing is to decide what to do with pageblock_skip bitmap
      handling.  Compaction maintains a pageblock_skip bitmap to record
      pageblocks where isolation recently failed.  This bitmap can be reset by
      three ways:
      
      1) direct compaction is restarting after going through the full deferred cycle
      
      2) kswapd goes to sleep, and some other direct compaction has previously
         finished scanning the whole zone and set zone->compact_blockskip_flush.
         Note that a successful direct compaction clears this flag.
      
      3) compaction was invoked manually via trigger in /proc
      
      The case 2) is somewhat fuzzy to begin with, but after introducing
      kcompactd we should update it.  The check for direct compaction in 1),
      and to set the flush flag in 2) use current_is_kswapd(), which doesn't
      work for kcompactd.  Thus, this patch adds bool direct_compaction to
      compact_control to use in 2).  For the case 1) we remove the check
      completely - unlike the former kswapd compaction, kcompactd does use the
      deferred compaction functionality, so flushing tied to restarting from
      deferred compaction makes sense here.
      
      Note that when kswapd goes to sleep, kcompactd is woken up, so it will
      see the flushed pageblock_skip bits.  This is different from when the
      former kswapd compaction observed the bits and I believe it makes more
      sense.  Kcompactd can afford to be more thorough than a direct
      compaction trying to limit allocation latency, or kswapd whose primary
      goal is to reclaim.
      
      For testing, I used stress-highalloc configured to do order-9
      allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
      on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
      phases 1 and 2 work as usual):
      
      stress-highalloc
                              4.5-rc1+before          4.5-rc1+after
                                   -nodirect              -nodirect
      Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)
      Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)
      Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)
      Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)
      Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)
      Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)
      Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)
      
      User                          3166.67        3181.09
      System                        1153.37        1158.25
      Elapsed                       1768.53        1799.37
      
                                  4.5-rc1+before   4.5-rc1+after
                                       -nodirect    -nodirect
      Direct pages scanned                32938        32797
      Kswapd pages scanned              2183166      2202613
      Kswapd pages reclaimed            2152359      2143524
      Direct pages reclaimed              32735        32545
      Percentage direct scans                1%           1%
      THP fault alloc                       579          612
      THP collapse alloc                    304          316
      THP splits                              0            0
      THP fault fallback                    793          778
      THP collapse fail                      11           16
      Compaction stalls                    1013         1007
      Compaction success                     92           67
      Compaction failures                   920          939
      Page migrate success               238457       721374
      Page migrate failure                23021        23469
      Compaction pages isolated          504695      1479924
      Compaction migrate scanned         661390      8812554
      Compaction free scanned          13476658     84327916
      Compaction cost                       262          838
      
      After this patch we see improvements in allocation success rate
      (especially for phase 3) along with increased compaction activity.  The
      compaction stalls (direct compaction) in the interfering kernel builds
      (probably THP's) also decreased somewhat thanks to kcompactd activity,
      yet THP alloc successes improved a bit.
      
      Note that elapsed and user time isn't so useful for this benchmark,
      because of the background interference being unpredictable.  It's just
      to quickly spot some major unexpected differences.  System time is
      somewhat more useful and that didn't increase.
      
      Also (after adjusting mmtests' ftrace monitor):
      
      Time kswapd awake               2547781     2269241
      Time kcompactd awake                  0      119253
      Time direct compacting           939937      557649
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      119099
      
      The decrease of overal time spent compacting appears to not match the
      increased compaction stats.  I suspect the tasks get rescheduled and
      since the ftrace monitor doesn't see that, the reported time is wall
      time, not CPU time.  But arguably direct compactors care about overall
      latency anyway, whether busy compacting or waiting for CPU doesn't
      matter.  And that latency seems to almost halved.
      
      It's also interesting how much time kswapd spent awake just going
      through all the priorities and failing to even try compacting, over and
      over.
      
      We can also configure stress-highalloc to perform both direct
      reclaim/compaction and wakeup kswapd/kcompactd, by using
      GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
      
      stress-highalloc
                              4.5-rc1+before         4.5-rc1+after
                                     -direct               -direct
      Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)
      Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)
      Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)
      Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)
      Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
      Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
      Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)
      
      User                          3344.73       3246.04
      System                        1194.24       1172.29
      Elapsed                       1838.04       1836.76
      
                                  4.5-rc1+before  4.5-rc1+after
                                         -direct     -direct
      Direct pages scanned               125146      120966
      Kswapd pages scanned              2119757     2135012
      Kswapd pages reclaimed            2073183     2108388
      Direct pages reclaimed             124909      120577
      Percentage direct scans                5%          5%
      THP fault alloc                       599         652
      THP collapse alloc                    323         354
      THP splits                              0           0
      THP fault fallback                    806         793
      THP collapse fail                      17          16
      Compaction stalls                    2457        2025
      Compaction success                    906         518
      Compaction failures                  1551        1507
      Page migrate success              2031423     2360608
      Page migrate failure                32845       40852
      Compaction pages isolated         4129761     4802025
      Compaction migrate scanned       11996712    21750613
      Compaction free scanned         214970969   344372001
      Compaction cost                      2271        2694
      
      In this scenario, this patch doesn't change the overall success rate as
      direct compaction already tries all it can.  There's however significant
      reduction in direct compaction stalls (that is, the number of
      allocations that went into direct compaction).  The number of successes
      (i.e.  direct compaction stalls that ended up with successful
      allocation) is reduced by the same number.  This means the offload to
      kcompactd is working as expected, and direct compaction is reduced
      either due to detecting contention, or compaction deferred by kcompactd.
      In the previous version of this patchset there was some apparent
      reduction of success rate, but the changes in this version (such as
      using sync compaction only), new baseline kernel, and/or averaging
      results from 5 executions (my bet), made this go away.
      
      Ftrace-based stats seem to roughly agree:
      
      Time kswapd awake               2532984     2326824
      Time kcompactd awake                  0      257916
      Time direct compacting           864839      735130
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      257585
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      accf6242
    • Vlastimil Babka's avatar
      mm, memory hotplug: small cleanup in online_pages() · e888ca35
      Vlastimil Babka authored
      
      
      We can reuse the nid we've determined instead of repeated pfn_to_nid()
      usages.  Also zone_to_nid() should be a bit cheaper in general than
      pfn_to_nid().
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e888ca35
    • Vlastimil Babka's avatar
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka authored
      
      
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
    • Vlastimil Babka's avatar
      mm, kswapd: remove bogus check of balance_classzone_idx · 81c5857b
      Vlastimil Babka authored
      During work on kcompactd integration I have spotted a confusing check of
      balance_classzone_idx, which I believe is bogus.
      
      The balanced_classzone_idx is filled by balance_pgdat() as the highest
      zone it attempted to balance.  This was introduced by commit dc83edd9
      ("mm: kswapd: use the classzone idx that kswapd was using for
      sleeping_prematurely()").
      
      The intention is that (as expressed in today's function names), the
      value used for kswapd_shrink_zone() calls in balance_pgdat() is the same
      as for the decisions in kswapd_try_to_sleep().
      
      An unwanted side-effect of that commit was breaking the checks in
      kswapd() whether there was another kswapd_wakeup with a tighter (=lower)
      classzone_idx.  Commits 215ddd66 ("mm: vmscan: only read
      new_classzone_idx from pgdat when reclaiming successfully") and
      d2ebd0f6
      
       ("kswapd: avoid unnecessary rebalance after an unsuccessful
      balancing") tried to fixed, but apparently introduced a bogus check that
      this patch removes.
      
      Consider zone indexes X < Y < Z, where:
      - Z is the value used for the first kswapd wakeup.
      - Y is returned as balanced_classzone_idx, which means zones with index higher
        than Y (including Z) were found to be unreclaimable.
      - X is the value used for the second kswapd wakeup
      
      The new wakeup with value X means that kswapd is now supposed to balance
      harder all zones with index <= X.  But instead, due to Y < Z, it will go
      sleep and won't read the new value X.  This is subtly wrong.
      
      The effect of this patch is that kswapd will react better in some
      situations, where e.g.  the first wakeup is for ZONE_DMA32, the second is
      for ZONE_DMA, and due to unreclaimable ZONE_NORMAL.  Before this patch,
      kswapd would go sleep instead of reclaiming ZONE_DMA harder.  I expect
      these situations are very rare, and more value is in better
      maintainability due to the removal of confusing and bogus check.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81c5857b
    • Joonsoo Kim's avatar
      tile: query dynamic DEBUG_PAGEALLOC setting · 21c64786
      Joonsoo Kim authored
      
      
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: default avatarChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21c64786
    • Joonsoo Kim's avatar
      powerpc: query dynamic DEBUG_PAGEALLOC setting · e7df0d88
      Joonsoo Kim authored
      
      
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7df0d88
    • Joonsoo Kim's avatar
      sound: query dynamic DEBUG_PAGEALLOC setting · 505f6d22
      Joonsoo Kim authored
      
      
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: export _debug_pagealloc_enabled to modules]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarTakashi Iwai <tiwai@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      505f6d22
    • Joonsoo Kim's avatar
      mm/slub: query dynamic DEBUG_PAGEALLOC setting · 922d566c
      Joonsoo Kim authored
      
      
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: clean up code, per Christian]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      922d566c
    • Joonsoo Kim's avatar
      mm/vmalloc: query dynamic DEBUG_PAGEALLOC setting · f48d97f3
      Joonsoo Kim authored
      As CONFIG_DEBUG_PAGEALLOC can be enabled/disabled via kernel parameters
      we can optimize some cases by checking the enablement state.
      
      This is follow-up work for Christian's Optimize CONFIG_DEBUG_PAGEALLOC:
      
        https://lkml.org/lkml/2016/1/27/194
      
      
      
      Remaining work is to make sparc to be aware of this but it looks not
      easy for me so I skip that in this series.
      
      This patch (of 5):
      
      We can disable debug_pagealloc processing even if the code is complied
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: update comment, per David.  Adjust comment to use 80 cols]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Takashi Iwai <tiwai@suse.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f48d97f3
    • Naoya Horiguchi's avatar
      tools/vm/page-types.c: support swap entry · 0335ddd3
      Naoya Horiguchi authored
      
      
      /proc/pid/pagemap (pte_to_pagemap_entry() internally) already reports
      about swap entry, so let's make the in-kernel utility aware of it.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0335ddd3
    • Naoya Horiguchi's avatar
      /proc/kpageflags: return KPF_SLAB for slab tail pages · 0a71649c
      Naoya Horiguchi authored
      
      
      Currently /proc/kpageflags returns just KPF_COMPOUND_TAIL for slab tail
      pages, which is inconvenient when grasping how slab pages are
      distributed (userspace always needs to check which kind of tail pages by
      itself).  This patch sets KPF_SLAB for such pages.
      
      With this patch:
      
        $ grep Slab /proc/meminfo ; tools/vm/page-types -b slab
        Slab:              64880 kB
                     flags      page-count       MB  symbolic-flags                     long-symbolic-flags
        0x0000000000000080           16220       63  _______S__________________________________ slab
                     total           16220       63
      
      16220 pages equals to 64880 kB, so returned result is consistent with the
      global counter.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a71649c
    • Naoya Horiguchi's avatar
      /proc/kpageflags: return KPF_BUDDY for "tail" buddy pages · 832fc1de
      Naoya Horiguchi authored
      
      
      Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
      is inconvenient when grasping how free pages are distributed.  This
      patch sets KPF_BUDDY for such pages.
      
      With this patch:
      
        $ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
        MemFree:         3134992 kB
                     flags      page-count       MB  symbolic-flags                     long-symbolic-flags
        0x0000000000000400          779272     3044  __________B_______________________________ buddy
        0x0000000000000c00            4385       17  __________BM______________________________ buddy,mmap
                     total          783657     3061
      
      783657 pages is 3134628 kB (roughly consistent with the global counter,)
      so it's OK.
      
      [akpm@linux-foundation.org: update comment, per Naoya]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com&gt;>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      832fc1de
    • Vladimir Davydov's avatar
      mm: memcontrol: report kernel stack usage in cgroup2 memory.stat · 12580e4b
      Vladimir Davydov authored
      
      
      Show how much memory is allocated to kernel stacks.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12580e4b
    • Vladimir Davydov's avatar
      mm: memcontrol: report slab usage in cgroup2 memory.stat · 27ee57c9
      Vladimir Davydov authored
      
      
      Show how much memory is used for storing reclaimable and unreclaimable
      in-kernel data structures allocated from slab caches.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27ee57c9
    • Vladimir Davydov's avatar
      mm: memcontrol: make tree_{stat,events} fetch all stats · 72b54e73
      Vladimir Davydov authored
      
      
      Currently, tree_{stat,events} helpers can only get one stat index at a
      time, so when there are a lot of stats to be reported one has to call it
      over and over again (see memory_stat_show).  This is neither effective,
      nor does it look good.  Instead, let's make these helpers take a
      snapshot of all available counters.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72b54e73
    • Vladimir Davydov's avatar
      mm: memcontrol: do not bypass slab charge if memcg is offline · fcff7d7e
      Vladimir Davydov authored
      
      
      Slab pages are charged in two steps.  First, an appropriate per memcg
      cache is selected (see memcg_kmem_get_cache) basing on the current
      context, then the new slab page is charged to the memory cgroup which
      the selected cache was created for (see memcg_charge_slab ->
      __memcg_kmem_charge_memcg).  It is OK to bypass kmemcg charge at step 1,
      but if step 1 succeeded and we successfully allocated a new slab page,
      step 2 must be performed, otherwise we would get a per memcg kmem cache
      which contains a slab that does not hold a reference to the memory
      cgroup owning the cache.  Since per memcg kmem caches are destroyed on
      memcg css free, this could result in freeing a cache while there are
      still active objects in it.
      
      However, currently we will bypass slab page charge if the memory cgroup
      owning the cache is offline (see __memcg_kmem_charge_memcg).  This is
      very unlikely to occur in practice, because for this to happen a process
      must be migrated to a different cgroup and the old cgroup must be
      removed while the process is in kmalloc somewhere between steps 1 and 2
      (e.g.  trying to allocate a new page).  Nevertheless, it's still better
      to eliminate such a possibility.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcff7d7e
    • Johannes Weiner's avatar
      mm: oom_kill: don't ignore oom score on exiting tasks · 6a618957
      Johannes Weiner authored
      
      
      When the OOM killer scans tasks and encounters a PF_EXITING one, it
      force-selects that task regardless of the score.  The problem is that if
      that task got stuck waiting for some state the allocation site is
      holding, the OOM reaper can not move on to the next best victim.
      
      Frankly, I don't even know why we check for exiting tasks in the OOM
      killer.  We've tried direct reclaim at least 15 times by the time we
      decide the system is OOM, there was plenty of time to exit and free
      memory; and a task might exit voluntarily right after we issue a kill.
      This is testing pure noise.  Remove it.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a618957
    • Joshua Hunt's avatar
      watchdog: don't run proc_watchdog_update if new value is same as old · a1ee1932
      Joshua Hunt authored
      
      
      While working on a script to restore all sysctl params before a series of
      tests I found that writing any value into the
      /proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
      causes them to call proc_watchdog_update().
      
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
      
      There doesn't appear to be a reason for doing this work every time a write
      occurs, so only do it when the values change.
      Signed-off-by: default avatarJosh Hunt <johunt@akamai.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Reviewed-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.1.x+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-o...
      a1ee1932
    • Aaro Koskinen's avatar
      drivers/firmware/broadcom/bcm47xx_nvram.c: fix incorrect __ioread32_copy · 4c11e554
      Aaro Koskinen authored
      Commit 1f330c32 ("drivers/firmware/broadcom/bcm47xx_nvram.c: use
      __ioread32_copy() instead of open-coding") switched to use a generic
      copy function, but failed to notice that the header pointer is updated
      between the two copies, resulting in bogus data being copied in the
      latter one.  Fix by keeping the old header pointer.
      
      The patch fixes totally broken networking on WRT54GL router (both LAN and
      WLAN interfaces fail to probe).
      
      Fixes: 1f330c32
      
       ("drivers/firmware/broadcom/bcm47xx_nvram.c: use __ioread32_copy() instead of open-coding")
      Signed-off-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Reviewed-by: default avatarStephen Boyd <sboyd@codeaurora.org>
      Cc: Rafal Milecki <zajec5@gmail.com>
      Cc: Hauke Mehrtens <hauke@hauke-m.de>
      Cc: <stable@vger.kernel.org>	[4.4.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c11e554
    • Luis R. Rodriguez's avatar
      ia64: define ioremap_uc() · b0f84ac3
      Luis R. Rodriguez authored
      
      
      All architectures now need ioremap_uc(), ia64 seems defines this already
      through its ioremap_nocache() and it already ensures it *only* uses UC.
      
      This is needed since v4.3 to complete an allyesconfig compile on ia64,
      there were others archs that needed this, and this one seems to have
      fallen through the cracks.
      Signed-off-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Acked-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0f84ac3
    • Linus Torvalds's avatar
      Merge tag 'fbdev-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux · 09fd671c
      Linus Torvalds authored
      Pull fbdev updates from Tomi Valkeinen:
      
       - Miscallaneous small fixes to various fbdev drivers
      
       - Remove fb_rotate, which was never used
      
       - pmag fb improvements
      
      * tag 'fbdev-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux: (21 commits)
        xen kconfig: don't "select INPUT_XEN_KBDDEV_FRONTEND"
        video: fbdev: sis: remove unused variable
        drivers/video: make fbdev/sunxvr2500.c explicitly non-modular
        drivers/video: make fbdev/sunxvr1000.c explicitly non-modular
        drivers/video: make fbdev/sunxvr500.c explicitly non-modular
        video: exynos: fix modular build
        fbdev: da8xx-fb: fix videomodes of lcd panels
        fbdev: kill fb_rotate
        video: fbdev: bt431: Correct cursor format control macro
        video: fbdev: pmag-ba-fb: Optimize Bt455 colormap addressing
        video: fbdev: pmag-ba-fb: Fix and rework Bt455 colormap handling
        video: fbdev: bt455: Remove unneeded colormap helpers for cursor support
        video: fbdev: pmag-aa-fb: Report video timings
        video: fbdev: pmag-aa-fb: Enable building as a module
        video: fbdev: pmag-aa-fb: Adapt to current APIs
        video: fbdev: pmag-ba-fb: Fix the lower margin size
        fbdev: sh_mobile_lcdc: Use ARCH_RENESAS
        fbdev: n411: check return value
        fbdev: exynos: fix IS_ERR_VALUE usage
        video: Use bool instead int pointer for get_opt_bool() argument
        ...
      09fd671c
    • Linus Torvalds's avatar
      Merge tag 'media/v4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · bace3db5
      Linus Torvalds authored
      Pull media updates from Mauro Carvalho Chehab:
       - Added support for some new video formats
       - mn88473 DVB frontend driver got promoted from staging
       - several improvements at the VSP1 driver
       - several cleanups and improvements at the Media Controller
       - added Media Controller support to snd-usb-audio.  Currently, enabled
         only for au0828-based V4L2/DVB boards
       - Several improvements at nuvoton-cir: it now supports wake up codes
       - Add media controller support to em28xx and saa7134 drivers
       - coda driver now accepts NXP distributed firmware files
       - Some legacy SoC camera drivers will be moving to staging, as they're
         outdated and nobody so far is willing to fix and convert them to use
         the current media framework
       - As usual, lots of cleanups, improvements and new board additions.
      
      * tag 'media/v4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (381 commits)
        media: au0828 disable tuner to demod link in au0828_media_device_register()
        [media] touptek: cast char types on %x printk
        [media] touptek: don't DMA at the stack
        [media] mceusb: use %*ph for small buffer dumps
        [media] v4l: exynos4-is: Drop unneeded check when setting up fimc-lite links
        [media] v4l: vsp1: Check if an entity is a subdev with the right function
        [media] hide unused functions for !MEDIA_CONTROLLER
        [media] em28xx: fix Terratec Grabby AC97 codec detection
        [media] media: add prefixes to interface types
        [media] media: rc: nuvoton: switch attribute wakeup_data to text
        [media] v4l2-ioctl: fix YUV422P pixel format description
        [media] media: fix null pointer dereference in v4l_vb2q_enable_media_source()
        [media] v4l2-mc.h: fix yet more compiler errors
        [media] staging/media: add missing TODO files
        [media] media.h: always start with 1 for the audio entities
        [media] sound/usb: Use meaninful names for goto labels
        [media] v4l2-mc.h: fix compiler warnings
        [media] media: au0828 audio mixer isn't connected to decoder
        [media] sound/usb: Use Media Controller API to share media resources
        [media] dw2102: add support for TeVii S662
        ...
      bace3db5
    • Linus Torvalds's avatar
      Merge tag 'libnvdimm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 8759957b
      Linus Torvalds authored
      Pull libnvdimm updates from Dan Williams:
      
       - Asynchronous address range scrub:
      
           Given the capacities of next generation persistent memory devices a
           scrub operation to find all poison may take 10s of seconds.  We
           want this scrub work to be done asynchronously with the rest of
           system initialization, so we move it out of line from the NFIT
           probing, i.e. acpi_nfit_add().
      
       - Clear poison:
      
           ACPI 6.1 introduces the ability to send "clear error" commands to
           the ACPI0012:00 device representing the root of an "nvdimm bus".
           Similar to relocating a bad block on a disk, this support clears
           media errors in response to a write.
      
       - Persistent memory resource tracking:
      
           A persistent memory range may be designated as simply "reserved" by
           platform firmware in the efi/e820 memory map.  Later when the NFIT
           driver loads it discovers that the range is "Persistent Memory".
      
           The NFIT bus driver inserts a resource to advertise that
           "persistent" attribute in the system resource tree for /proc/iomem
           and kernel-internal usages.
      
       - Miscellaneous cleanups and fixes:
      
           Workaround section misaligned pmem ranges when allocating a struct
           page memmap, fix handling of the read-only case in the ioctl path,
           and clean up block device major number allocation.
      
      * tag 'libnvdimm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
        libnvdimm, pmem: clear poison on write
        libnvdimm, pmem: fix kmap_atomic() leak in error path
        nvdimm/btt: don't allocate unused major device number
        nvdimm/blk: don't allocate unused major device number
        pmem: don't allocate unused major device number
        ACPI: Change NFIT driver to insert new resource
        resource: Export insert_resource and remove_resource
        resource: Add remove_resource interface
        resource: Change __request_region to inherit from immediate parent
        libnvdimm, pmem: fix ia64 build, use PHYS_PFN
        nfit, libnvdimm: clear poison command support
        libnvdimm, pfn: 'resource'-address and 'size' attributes for pfn devices
        libnvdimm, pmem: adjust for section collisions with 'System RAM'
        libnvdimm, pmem: fix 'pfn' support for section-misaligned namespaces
        libnvdimm: Fix security issue with DSM IOCTL.
        libnvdimm: Clean-up access mode check.
        tools/testing/nvdimm: expand ars unit testing
        nfit: disable userspace initiated ars during scrub
        nfit: scrub and register regions in a workqueue
        nfit, libnvdimm: async region scrub workqueue
        ...
      8759957b
    • Linus Torvalds's avatar
      Merge tag 'dm-4.6-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · 6968e6f8
      Linus Torvalds authored
      Pull device mapper updates from Mike Snitzer:
      
       - Most attention this cycle went to optimizing blk-mq request-based DM
         (dm-mq) that is used exclussively by DM multipath:
      
           - A stable fix for dm-mq that eliminates excessive context
             switching offers the biggest performance improvement (for both
             IOPs and throughput).
      
           - But more work is needed, during the next cycle, to reduce
             spinlock contention in DM multipath on large NUMA systems.
      
       - A stable fix for a NULL pointer seen when DM stats is enabled on a DM
         multipath device that must requeue an IO due to path failure.
      
       - A stable fix for DM snapshot to disallow the COW and origin devices
         from being identical.  This amounts to graceful failure in the face
         of userspace error because these devices shouldn't ever be identical.
      
       - Stable fixes for DM cache and DM thin provisioning to address crashes
         seen if/when their respective metadata device experiences failures
         that cause the transition to 'fail_io' mode.
      
       - The DM cache 'mq' policy is now an alias for the 'smq' policy.  The
         'smq' policy proved to be consistently better than 'mq'.  As such
         'mq', with all its complex user-facing tunables, has been eliminated.
      
       - Improve DM thin provisioning to consistently return -ENOSPC once the
         thin-pool's data volume is out of space.
      
       - Improve DM core to properly handle error propagation if
         bio_integrity_clone() fails in clone_bio().
      
       - Other small cleanups and improvements to DM core.
      
      * tag 'dm-4.6-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (41 commits)
        dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request()
        dm thin: consistently return -ENOSPC if pool has run out of data space
        dm cache: bump the target version
        dm cache: make sure every metadata function checks fail_io
        dm: add missing newline between DM_DEBUG_BLOCK_STACK_TRACING and DM_BUFIO
        dm cache policy smq: clarify that mq registration failure was for 'mq'
        dm: return error if bio_integrity_clone() fails in clone_bio()
        dm thin metadata: don't issue prefetches if a transaction abort has failed
        dm snapshot: disallow the COW and origin devices from being identical
        dm cache: make the 'mq' policy an alias for 'smq'
        dm: drop unnecessary assignment of md->queue
        dm: reorder 'struct mapped_device' members to fix alignment and holes
        dm: remove dummy definition of 'struct dm_table'
        dm: add 'dm_numa_node' module parameter
        dm thin metadata: remove needless newline from subtree_dec() DMERR message
        dm mpath: cleanup reinstate_path() et al based on code review
        dm mpath: remove __pgpath_busy forward declaration, rename to pgpath_busy
        dm mpath: switch from 'unsigned' to 'bool' for flags where appropriate
        dm round robin: use percpu 'repeat_count' and 'current_path'
        dm path selector: remove 'repeat_count' return from .select_path hook
        ...
      6968e6f8
    • Linus Torvalds's avatar
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · cae8da04
      Linus Torvalds authored
      Pull SCSI updates from James Bottomley:
       "This pull includes driver updates from the usual suspects (stex, hpsa,
        ncr5380, scsi_dh, qla2xxx, be2iscsi, hisi_sas, cxlflash, aacraid,
        mp3sas, megaraid_sas, ibmvscsi, ufs) plus an assortment of
        miscellaneous fixes.
      
        The major user visible change of this pull is that we've moved from
        monotonically increasing host number to an ida allocated one (meaning
        the numbers get re-used) because someone managed to wrap the count in
        an iscsi system.  We don't believe there will be any adverse
        consequences of this"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (230 commits)
        MAINTAINERS: use new email address for James Bottomley
        mpt3sas: Remove unnecessary synchronize_irq() before free_irq()
        sg: fix dxferp in from_to case
        cxlflash: Increase cmd_per_lun for better throughput
        cxlflash: Fix to avoid unnecessary scan with internal LUNs
        cxlflash: Reorder user context initialization
        cxlflash: Simplify attach path error cleanup
        cxlflash: Split out context initialization
        cxlflash: Unmap problem state area before detaching master context
        cxlflash: Simplify PCI registration
        scsi: storvsc: fix SRB_STATUS_ABORTED handling
        be2iscsi: set the boot_kset pointer to NULL in case of failure
        sd: Fix discard granularity when LBPRZ=1
        be2iscsi: Remove unnecessary synchronize_irq() before free_irq()
        scsi_sysfs: call 'device_add' after attaching device handler
        scsi_dh_emc: update 'access_state' field
        scsi_dh_rdac: update 'access_state' field
        scsi_dh_alua: update 'access_state' field
        scsi_dh_alua: use common definitions for ALUA state
        scsi: Add 'access_state' and 'preferred_path' attribute
        ...
      cae8da04
    • Linus Torvalds's avatar
      Merge branch 'stable/for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft · 7bb7a748
      Linus Torvalds authored
      Pull iscsi_ibft update from Konrad Rzeszutek Wilk:
       "A simple patch that had been rattling around in SuSE repo"
      
      * 'stable/for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft:
        iscsi_ibft: Add prefix-len attr and display netmask
      7bb7a748
  2. 16 Mar, 2016 2 commits
    • Linus Torvalds's avatar
      Merge tag 'pci-v4.6-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 63e30271
      Linus Torvalds authored
      Pull PCI updates from Bjorn Helgaas:
       "PCI changes for v4.6:
      
        Enumeration:
         - Disable IO/MEM decoding for devices with non-compliant BARs (Bjorn Helgaas)
         - Mark Broadwell-EP Home Agent & PCU as having non-compliant BARs (Bjorn Helgaas
      
        Resource management:
         - Mark shadow copy of VGA ROM as IORESOURCE_PCI_FIXED (Bjorn Helgaas)
         - Don't assign or reassign immutable resources (Bjorn Helgaas)
         - Don't enable/disable ROM BAR if we're using a RAM shadow copy (Bjorn Helgaas)
         - Set ROM shadow location in arch code, not in PCI core (Bjorn Helgaas)
         - Remove arch-specific IORESOURCE_ROM_SHADOW size from sysfs (Bjorn Helgaas)
         - ia64: Use ioremap() instead of open-coded equivalent (Bjorn Helgaas)
         - ia64: Keep CPU physical (not virtual) addresses in shadow ROM resource (Bjorn Helgaas)
         - MIPS: Keep CPU physical (not virtual) addresses in shadow ROM resource (Bjorn Helgaas)
         - Remove unused IORESOURCE_ROM_COPY and IORESOURCE_ROM_BIOS_COPY (Bjorn Helgaas)
         - Don't leak memory if sysfs_create_bin_file() fails (Bjorn Helgaas)
         - rcar: Remove PCI_PROBE_ONLY handling (Lorenzo Pieralisi)
         - designware: Remove PCI_PROBE_ONLY handling (Lorenzo Pieralisi)
      
        Virtualization:
         - Wait for up to 1000ms after FLR reset (Alex Williamson)
         - Support SR-IOV on any function type (Kelly Zytaruk)
         - Add ACS quirk for all Cavium devices (Manish Jaggi)
      
        AER:
         - Rename pci_ops_aer to aer_inj_pci_ops (Bjorn Helgaas)
         - Restore pci_ops pointer while calling original pci_ops (David Daney)
         - Fix aer_inject error codes (Jean Delvare)
         - Use dev_warn() in aer_inject (Jean Delvare)
         - Log actual error causes in aer_inject (Jean Delvare)
         - Log aer_inject error injections (Jean Delvare)
      
        VPD:
         - Prevent VPD access for buggy devices (Babu Moger)
         - Move pci_read_vpd() and pci_write_vpd() close to other VPD code (Bjorn Helgaas)
         - Move pci_vpd_release() from header file to pci/access.c (Bjorn Helgaas)
         - Remove struct pci_vpd_ops.release function pointer (Bjorn Helgaas)
         - Rename VPD symbols to remove unnecessary "pci22" (Bjorn Helgaas)
         - Fold struct pci_vpd_pci22 into struct pci_vpd (Bjorn Helgaas)
         - Sleep rather than busy-wait for VPD access completion (Bjorn Helgaas)
         - Update VPD definitions (Hannes Reinecke)
         - Allow access to VPD attributes with size 0 (Hannes Reinecke)
         - Determine actual VPD size on first access (Hannes Reinecke)
      
        Generic host bridge driver:
         - Move structure definitions to separate header file (David Daney)
         - Add pci_host_common_probe(), based on gen_pci_probe() (David Daney)
         - Expose pci_host_common_probe() for use by other drivers (David Daney)
      
        Altera host bridge driver:
         - Fix altera_pcie_link_is_up() (Ley Foon Tan)
      
        Cavium ThunderX host bridge driver:
         - Add PCIe host driver for ThunderX processors (David Daney)
         - Add driver for ThunderX-pass{1,2} on-chip devices (David Daney)
      
        Freescale i.MX6 host bridge driver:
         - Add DT bindings to configure PHY Tx driver settings (Justin Waters)
         - Move imx6_pcie_reset_phy() near other PHY handling functions (Lucas Stach)
         - Move PHY reset into imx6_pcie_establish_link() (Lucas Stach)
         - Remove broken Gen2 workaround (Lucas Stach)
         - Move link up check into imx6_pcie_wait_for_link() (Lucas Stach)
      
        Freescale Layerscape host bridge driver:
         - Add "fsl,ls2085a-pcie" compatible ID (Yang Shi)
      
        Intel VMD host bridge driver:
         - Attach VMD resources to parent domain's resource tree (Jon Derrick)
         - Set bus resource start to 0 (Keith Busch)
      
        Microsoft Hyper-V host bridge driver:
         - Add fwnode_handle to x86 pci_sysdata (Jake Oshins)
         - Look up IRQ domain by fwnode_handle (Jake Oshins)
         - Add paravirtual PCI front-end for Microsoft Hyper-V VMs (Jake Oshins)
      
        NVIDIA Tegra host bridge driver:
         - Add pci_ops.{add,remove}_bus() callbacks (Thierry Reding)
         - Implement ->{add,remove}_bus() callbacks (Thierry Reding)
         - Remove unused struct tegra_pcie.num_ports field (Thierry Reding)
         - Track bus -> CPU mapping (Thierry Reding)
         - Remove misleading PHYS_OFFSET (Thierry Reding)
      
        Renesas R-Car host bridge driver:
         - Depend on ARCH_RENESAS, not ARCH_SHMOBILE (Simon Horman)
      
        Synopsys DesignWare host bridge driver:
         - ARC: Add PCI support (Joao Pinto)
         - Add generic dw_pcie_wait_for_link() (Joao Pinto)
         - Add default link up check if sub-driver doesn't override (Joao Pinto)
         - Add driver for prototyping kits based on ARC SDP (Joao Pinto)
      
        TI Keystone host bridge driver:
         - Defer probing if devm_phy_get() returns -EPROBE_DEFER (Shawn Lin)
      
        Xilinx AXI host bridge driver:
         - Use of_pci_get_host_bridge_resources() to parse DT (Bharat Kumar Gogada)
         - Remove dependency on ARM-specific struct hw_pci (Bharat Kumar Gogada)
         - Don't call pci_fixup_irqs() on Microblaze (Bharat Kumar Gogada)
         - Update Zynq binding with Microblaze node (Bharat Kumar Gogada)
         - microblaze: Support generic Xilinx AXI PCIe Host Bridge IP driver (Bharat Kumar Gogada)
      
        Xilinx NWL host bridge driver:
         - Add support for Xilinx NWL PCIe Host Controller (Bharat Kumar Gogada)
      
        Miscellaneous:
         - Check device_attach() return value always (Bjorn Helgaas)
         - Move pci_set_flags() from asm-generic/pci-bridge.h to linux/pci.h (Bjorn Helgaas)
         - Remove includes of empty asm-generic/pci-bridge.h (Bjorn Helgaas)
         - ARM64: Remove generated include of asm-generic/pci-bridge.h (Bjorn Helgaas)
         - Remove empty asm-generic/pci-bridge.h (Bjorn Helgaas)
         - Remove includes of asm/pci-bridge.h (Bjorn Helgaas)
         - Consolidate PCI DMA constants and interfaces in linux/pci-dma-compat.h (Bjorn Helgaas)
         - unicore32: Remove unused HAVE_ARCH_PCI_SET_DMA_MASK definition (Bjorn Helgaas)
         - Cleanup pci/pcie/Kconfig whitespace (Andreas Ziegler)
         - Include pci/hotplug Kconfig directly from pci/Kconfig (Bjorn Helgaas)
         - Include pci/pcie/Kconfig directly from pci/Kconfig (Bogicevic Sasa)
         - frv: Remove stray pci_{alloc,free}_consistent() declaration (Christoph Hellwig)
         - Move pci_dma_* helpers to common code (Christoph Hellwig)
         - Add PCI_CLASS_SERIAL_USB_DEVICE definition (Heikki Krogerus)
         - Add QEMU top-level IDs for (sub)vendor & device (Robin H. Johnson)
         - Fix broken URL for Dell biosdevname (Naga Venkata Sai Indubhaskar Jupudi)"
      
      * tag 'pci-v4.6-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (94 commits)
        PCI: Add PCI_CLASS_SERIAL_USB_DEVICE definition
        PCI: designware: Add driver for prototyping kits based on ARC SDP
        PCI: designware: Add default link up check if sub-driver doesn't override
        PCI: designware: Add generic dw_pcie_wait_for_link()
        PCI: Cleanup pci/pcie/Kconfig whitespace
        PCI: Simplify pci_create_attr() control flow
        PCI: Don't leak memory if sysfs_create_bin_file() fails
        PCI: Simplify sysfs ROM cleanup
        PCI: Remove unused IORESOURCE_ROM_COPY and IORESOURCE_ROM_BIOS_COPY
        MIPS: Loongson 3: Keep CPU physical (not virtual) addresses in shadow ROM resource
        MIPS: Loongson 3: Use temporary struct resource * to avoid repetition
        ia64/PCI: Keep CPU physical (not virtual) addresses in shadow ROM resource
        ia64/PCI: Use ioremap() instead of open-coded equivalent
        ia64/PCI: Use temporary struct resource * to avoid repetition
        PCI: Clean up pci_map_rom() whitespace
        PCI: Remove arch-specific IORESOURCE_ROM_SHADOW size from sysfs
        PCI: thunder: Add driver for ThunderX-pass{1,2} on-chip devices
        PCI: thunder: Add PCIe host driver for ThunderX processors
        PCI: generic: Expose pci_host_common_probe() for use by other drivers
        PCI: generic: Add pci_host_common_probe(), based on gen_pci_probe()
        ...
      63e30271
    • Linus Torvalds's avatar
      Merge tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 277edbab
      Linus Torvalds authored
      Pull power management and ACPI updates from Rafael Wysocki:
       "This time the majority of changes go into cpufreq and they are
        significant.
      
        First off, the way CPU frequency updates are triggered is different
        now.  Instead of having to set up and manage a deferrable timer for
        each CPU in the system to evaluate and possibly change its frequency
        periodically, cpufreq governors set up callbacks to be invoked by the
        scheduler on a regular basis (basically on utilization updates).  The
        "old" governors, "ondemand" and "conservative", still do all of their
        work in process context (although that is triggered by the scheduler
        now), but intel_pstate does it all in the callback invoked by the
        scheduler with no need for any additional asynchronous processing.
      
        Of course, this eliminates the overhead related to the management of
        all those timers, but also it allows the cpufreq governor code to be
        simplified quite a bit.  On top of that, the common code and data
        structures used by the "ondemand" and "conservative" governors are
        cleaned up and made more straightforward and some long-standing and
        quite annoying problems are addressed.  In particular, the handling of
        governor sysfs attributes is modified and the related locking becomes
        more fine grained which allows some concurrency problems to be avoided
        (particularly deadlocks with the core cpufreq code).
      
        In principle, the new mechanism for triggering frequency updates
        allows utilization information to be passed from the scheduler to
        cpufreq.  Although the current code doesn't make use of it, in the
        works is a new cpufreq governor that will make decisions based on the
        scheduler's utilization data.  That should allow the scheduler and
        cpufreq to work more closely together in the long run.
      
        In addition to the core and governor changes, cpufreq drivers are
        updated too.  Fixes and optimizations go into intel_pstate, the
        cpufreq-dt driver is updated on top of some modification in the
        Operating Performance Points (OPP) framework and there are fixes and
        other updates in the powernv cpufreq driver.
      
        Apart from the cpufreq updates there is some new ACPICA material,
        including a fix for a problem introduced by previous ACPICA updates,
        and some less significant changes in the ACPI code, like CPPC code
        optimizations, ACPI processor driver cleanups and support for loading
        ACPI tables from initrd.
      
        Also updated are the generic power domains framework, the Intel RAPL
        power capping driver and the turbostat utility and we have a bunch of
        traditional assorted fixes and cleanups.
      
        Specifics:
      
         - Redesign of cpufreq governors and the intel_pstate driver to make
           them use callbacks invoked by the scheduler to trigger CPU
           frequency evaluation instead of using per-CPU deferrable timers for
           that purpose (Rafael Wysocki).
      
         - Reorganization and cleanup of cpufreq governor code to make it more
           straightforward and fix some concurrency problems in it (Rafael
           Wysocki, Viresh Kumar).
      
         - Cleanup and improvements of locking in the cpufreq core (Viresh
           Kumar).
      
         - Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
           Kumar, Eric Biggers).
      
         - intel_pstate driver updates including fixes, optimizations and a
           modification to make it enable enable hardware-coordinated P-state
           selection (HWP) by default if supported by the processor (Philippe
           Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
           Franciosi).
      
         - Operating Performance Points (OPP) framework updates to improve its
           handling of voltage regulators and device clocks and updates of the
           cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).
      
         - Updates of the powernv cpufreq driver to fix initialization and
           cleanup problems in it and correct its worker thread handling with
           respect to CPU offline, new powernv_throttle tracepoint (Shilpasri
           Bhat).
      
         - ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).
      
         - ACPICA updates including one fix for a regression introduced by
           previos changes in the ACPICA code (Bob Moore, Lv Zheng, David Box,
           Colin Ian King).
      
         - Support for installing ACPI tables from initrd (Lv Zheng).
      
         - Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
           Chaugule).
      
         - Support for _HID(ACPI0010) devices (ACPI processor containers) and
           ACPI processor driver cleanups (Sudeep Holla).
      
         - Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
           Aleksey Makarov).
      
         - Modification of the ACPI PCI IRQ management code to make it treat
           255 in the Interrupt Line register as "not connected" on x86 (as
           per the specification) and avoid attempts to use that value as a
           valid interrupt vector (Chen Fan).
      
         - ACPI APEI fixes related to resource leaks (Josh Hunt).
      
         - Removal of modularity from a few ACPI drivers (BGRT, GHES,
           intel_pmic_crc) that cannot be built as modules in practice (Paul
           Gortmaker).
      
         - PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
           as a valid resource type (Harb Abdulhamid).
      
         - New device ID (future AMD I2C controller) in the ACPI driver for
           AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).
      
         - Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).
      
         - cpuidle menu governor optimization to avoid a square root
           computation in it (Rasmus Villemoes).
      
         - Fix for potential use-after-free in the generic device properties
           framework (Heikki Krogerus).
      
         - Updates of the generic power domains (genpd) framework including
           support for multiple power states of a domain, fixes and debugfs
           output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
           Geert Uytterhoeven).
      
         - Intel RAPL power capping driver updates to reduce IPI overhead in
           it (Jacob Pan).
      
         - System suspend/hibernation code cleanups (Eric Biggers, Saurabh
           Sengar).
      
         - Year 2038 fix for the process freezer (Abhilash Jindal).
      
         - turbostat utility updates including new features (decoding of more
           registers and CPUID fields, sub-second intervals support, GFX MHz
           and RC6 printout, --out command line option), fixes (syscall jitter
           detection and workaround, reductioin of the number of syscalls
           made, fixes related to Xeon x200 processors, compiler warning
           fixes) and cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu)"
      
      * tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (182 commits)
        tools/power turbostat: bugfix: TDP MSRs print bits fixing
        tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
        tools/power turbostat: call __cpuid() instead of __get_cpuid()
        tools/power turbostat: indicate SMX and SGX support
        tools/power turbostat: detect and work around syscall jitter
        tools/power turbostat: show GFX%rc6
        tools/power turbostat: show GFXMHz
        tools/power turbostat: show IRQs per CPU
        tools/power turbostat: make fewer systems calls
        tools/power turbostat: fix compiler warnings
        tools/power turbostat: add --out option for saving output in a file
        tools/power turbostat: re-name "%Busy" field to "Busy%"
        tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
        tools/power turbostat: Intel Xeon x200: fix erroneous bclk value
        tools/power turbostat: allow sub-sec intervals
        ACPI / APEI: ERST: Fixed leaked resources in erst_init
        ACPI / APEI: Fix leaked resources
        intel_pstate: Do not skip samples partially
        intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
        intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
        ...
      277edbab