1. 29 Mar, 2020 1 commit
    • Roman Gushchin's avatar
      mm: fork: fix kernel_stack memcg stats for various stack implementations · 8380ce47
      Roman Gushchin authored
      Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
      space for task stacks can be allocated using __vmalloc_node_range(),
      alloc_pages_node() and kmem_cache_alloc_node().
      
      In the first and the second cases page->mem_cgroup pointer is set, but
      in the third it's not: memcg membership of a slab page should be
      determined using the memcg_from_slab_page() function, which looks at
      page->slab_cache->memcg_params.memcg .  In this case, using
      mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
      page->mem_cgroup pointer is NULL even for pages charged to a non-root
      memory cgroup.
      
      It can lead to kernel_stack per-memcg counters permanently showing 0 on
      some architectures (depending on the configuration).
      
      In order to fix it, let's introduce a mod_memcg_obj_state() helper,
      which takes a pointer to a kernel object as a first argument, uses
      mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
      mod_memcg_state().  It allows to handle all possible configurations
      (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
      spilling any memcg/kmem specifics into fork.c .
      
      Note: This is a special version of the patch created for stable
      backports.  It contains code from the following two patches:
        - mm: memcg/slab: introduce mem_cgroup_from_obj()
        - mm: fork: fix kernel_stack memcg stats for various stack implementations
      
      [guro@fb.com: introduce mem_cgroup_from_obj()]
        Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
      Fixes: 4d96ba35
      
       ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8380ce47
  2. 22 Mar, 2020 3 commits
    • Chris Down's avatar
      mm, memcg: throttle allocators based on ancestral memory.high · e26733e0
      Chris Down authored
      Prior to this commit, we only directly check the affected cgroup's
      memory.high against its usage.  However, it's possible that we are being
      reclaimed as a result of hitting an ancestor memory.high and should be
      penalised based on that, instead.
      
      This patch changes memory.high overage throttling to use the largest
      overage in its ancestors when considering how many penalty jiffies to
      charge.  This makes sure that we penalise poorly behaving cgroups in the
      same way regardless of at what level of the hierarchy memory.high was
      breached.
      
      Fixes: 0e4b01df
      
       ("mm, memcg: throttle allocators when failing reclaim over memory.high")
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[5.4.x+]
      Link: http://lkml.kernel.org/r/8cd132f84bd7e16cdb8fde3378cdbf05ba00d387.1584036142.git.chris@chrisdown.name
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e26733e0
    • Chris Down's avatar
      mm, memcg: fix corruption on 64-bit divisor in memory.high throttling · d397a45f
      Chris Down authored
      Commit 0e4b01df had a bunch of fixups to use the right division
      method.  However, it seems that after all that it still wasn't right --
      div_u64 takes a 32-bit divisor.
      
      The headroom is still large (2^32 pages), so on mundane systems you
      won't hit this, but this should definitely be fixed.
      
      Fixes: 0e4b01df
      
       ("mm, memcg: throttle allocators when failing reclaim over memory.high")
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.4.x+]
      Link: http://lkml.kernel.org/r/80780887060514967d414b3cd91f9a316a16ab98.1584036142.git.chris@chrisdown.name
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d397a45f
    • Chunguang Xu's avatar
      memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event · 7d36665a
      Chunguang Xu authored
      An eventfd monitors multiple memory thresholds of the cgroup, closes them,
      the kernel deletes all events related to this eventfd.  Before all events
      are deleted, another eventfd monitors the memory threshold of this cgroup,
      leading to a crash:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000004
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD 800000033058e067 P4D 800000033058e067 PUD 3355ce067 PMD 0
        Oops: 0002 [#1] SMP PTI
        CPU: 2 PID: 14012 Comm: kworker/2:6 Kdump: loaded Not tainted 5.6.0-rc4 #3
        Hardware name: LENOVO 20AWS01K00/20AWS01K00, BIOS GLET70WW (2.24 ) 05/21/2014
        Workqueue: events memcg_event_remove
        RIP: 0010:__mem_cgroup_usage_unregister_event+0xb3/0x190
        RSP: 0018:ffffb47e01c4fe18 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff8bb223a8a000 RCX: 0000000000000001
        RDX: 0000000000000001 RSI: ffff8bb22fb83540 RDI: 0000000000000001
        RBP: ffffb47e01c4fe48 R08: 0000000000000000 R09: 0000000000000010
        R10: 000000000000000c R11: 071c71c71c71c71c R12: ffff8bb226aba880
        R13: ffff8bb223a8a480 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff8bb242680000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000004 CR3: 000000032c29c003 CR4: 00000000001606e0
        Call Trace:
          memcg_event_remove+0x32/0x90
          process_one_work+0x172/0x380
          worker_thread+0x49/0x3f0
          kthread+0xf8/0x130
          ret_from_fork+0x35/0x40
        CR2: 0000000000000004
      
      We can reproduce this problem in the following ways:
      
      1. We create a new cgroup subdirectory and a new eventfd, and then we
         monitor multiple memory thresholds of the cgroup through this eventfd.
      
      2.  closing this eventfd, and __mem_cgroup_usage_unregister_event ()
         will be called multiple times to delete all events related to this
         eventfd.
      
      The first time __mem_cgroup_usage_unregister_event() is called, the
      kernel will clear all items related to this eventfd in thresholds->
      primary.
      
      Since there is currently only one eventfd, thresholds-> primary becomes
      empty, so the kernel will set thresholds-> primary and hresholds-> spare
      to NULL.  If at this time, the user creates a new eventfd and monitor
      the memory threshold of this cgroup, kernel will re-initialize
      thresholds-> primary.
      
      Then when __mem_cgroup_usage_unregister_event () is called for the
      second time, because thresholds-> primary is not empty, the system will
      access thresholds-> spare, but thresholds-> spare is NULL, which will
      trigger a crash.
      
      In general, the longer it takes to delete all events related to this
      eventfd, the easier it is to trigger this problem.
      
      The solution is to check whether the thresholds associated with the
      eventfd has been cleared when deleting the event.  If so, we do nothing.
      
      [akpm@linux-foundation.org: fix comment, per Kirill]
      Fixes: 907860ed
      
       ("cgroups: make cftype.unregister_event() void-returning")
      Signed-off-by: default avatarChunguang Xu <brookxu@tencent.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/077a6f67-aefa-4591-efec-f2f3af2b0b02@gmail.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d36665a
  3. 10 Mar, 2020 2 commits
    • Shakeel Butt's avatar
      net: memcg: late association of sock to memcg · d752a498
      Shakeel Butt authored
      
      
      If a TCP socket is allocated in IRQ context or cloned from unassociated
      (i.e. not associated to a memcg) in IRQ context then it will remain
      unassociated for its whole life. Almost half of the TCPs created on the
      system are created in IRQ context, so, memory used by such sockets will
      not be accounted by the memcg.
      
      This issue is more widespread in cgroup v1 where network memory
      accounting is opt-in but it can happen in cgroup v2 if the source socket
      for the cloning was created in root memcg.
      
      To fix the issue, just do the association of the sockets at the accept()
      time in the process context and then force charge the memory buffer
      already used and reserved by the socket.
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d752a498
    • Shakeel Butt's avatar
      cgroup: memcg: net: do not associate sock with unrelated cgroup · e876ecc6
      Shakeel Butt authored
      We are testing network memory accounting in our setup and noticed
      inconsistent network memory usage and often unrelated cgroups network
      usage correlates with testing workload. On further inspection, it
      seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
      irq context specially for cgroup v1.
      
      mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
      and kind of assumes that this can only happen from sk_clone_lock()
      and the source sock object has already associated cgroup. However in
      cgroup v1, where network memory accounting is opt-in, the source sock
      can be unassociated with any cgroup and the new cloned sock can get
      associated with unrelated interrupted cgroup.
      
      Cgroup v2 can also suffer if the source sock object was created by
      process in the root cgroup or if sk_alloc() is called in irq context.
      The fix is to just do nothing in interrupt.
      
      WARNING: Please note that about half of the TCP sockets are allocated
      from the IRQ context, so, memory used by such sockets will not be
      accouted by the memcg.
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      
      CPU: 70 PID: 12720 Comm: ssh Tainted:  5.6.0-smp-DEV #1
      Hardware name: ...
      Call Trace:
       <IRQ>
       dump_stack+0x57/0x75
       mem_cgroup_sk_alloc+0xe9/0xf0
       sk_clone_lock+0x2a7/0x420
       inet_csk_clone_lock+0x1b/0x110
       tcp_create_openreq_child+0x23/0x3b0
       tcp_v6_syn_recv_sock+0x88/0x730
       tcp_check_req+0x429/0x560
       tcp_v6_rcv+0x72d/0xa40
       ip6_protocol_deliver_rcu+0xc9/0x400
       ip6_input+0x44/0xd0
       ? ip6_protocol_deliver_rcu+0x400/0x400
       ip6_rcv_finish+0x71/0x80
       ipv6_rcv+0x5b/0xe0
       ? ip6_sublist_rcv+0x2e0/0x2e0
       process_backlog+0x108/0x1e0
       net_rx_action+0x26b/0x460
       __do_softirq+0x104/0x2a6
       do_softirq_own_stack+0x2a/0x40
       </IRQ>
       do_softirq.part.19+0x40/0x50
       __local_bh_enable_ip+0x51/0x60
       ip6_finish_output2+0x23d/0x520
       ? ip6table_mangle_hook+0x55/0x160
       __ip6_finish_output+0xa1/0x100
       ip6_finish_output+0x30/0xd0
       ip6_output+0x73/0x120
       ? __ip6_finish_output+0x100/0x100
       ip6_xmit+0x2e3/0x600
       ? ipv6_anycast_cleanup+0x50/0x50
       ? inet6_csk_route_socket+0x136/0x1e0
       ? skb_free_head+0x1e/0x30
       inet6_csk_xmit+0x95/0xf0
       __tcp_transmit_skb+0x5b4/0xb20
       __tcp_send_ack.part.60+0xa3/0x110
       tcp_send_ack+0x1d/0x20
       tcp_rcv_state_process+0xe64/0xe80
       ? tcp_v6_connect+0x5d1/0x5f0
       tcp_v6_do_rcv+0x1b1/0x3f0
       ? tcp_v6_do_rcv+0x1b1/0x3f0
       __release_sock+0x7f/0xd0
       release_sock+0x30/0xa0
       __inet_stream_connect+0x1c3/0x3b0
       ? prepare_to_wait+0xb0/0xb0
       inet_stream_connect+0x3b/0x60
       __sys_connect+0x101/0x120
       ? __sys_getsockopt+0x11b/0x140
       __x64_sys_connect+0x1a/0x20
       do_syscall_64+0x51/0x200
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      Fixes: 2d758073 ("mm: memcontrol: consolidate cgroup socket tracking")
      Fixes: d979a39d
      
       ("cgroup: duplicate cgroup reference when cloning sockets")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e876ecc6
  4. 21 Feb, 2020 1 commit
  5. 31 Jan, 2020 2 commits
  6. 14 Jan, 2020 1 commit
    • Roman Gushchin's avatar
      mm: memcg/slab: fix percpu slab vmstats flushing · 4a87e2a2
      Roman Gushchin authored
      Currently slab percpu vmstats are flushed twice: during the memcg
      offlining and just before freeing the memcg structure.  Each time percpu
      counters are summed, added to the atomic counterparts and propagated up
      by the cgroup tree.
      
      The second flushing is required due to how recursive vmstats are
      implemented: counters are batched in percpu variables on a local level,
      and once a percpu value is crossing some predefined threshold, it spills
      over to atomic values on the local and each ascendant levels.  It means
      that without flushing some numbers cached in percpu variables will be
      dropped on floor each time a cgroup is destroyed.  And with uptime the
      error on upper levels might become noticeable.
      
      The first flushing aims to make counters on ancestor levels more
      precise.  Dying cgroups may resume in the dying state for a long time.
      After kmem_cache reparenting which is performed during the offlining
      slab counters of the dying cgroup don't have any chances to be updated,
      because any slab operations will be performed on the parent level.  It
      means that the inaccuracy caused by percpu batching will not decrease up
      to the final destruction of the cgroup.  By the original idea flushing
      slab counters during the offlining should minimize the visible
      inaccuracy of slab counters on the parent level.
      
      The problem is that percpu counters are not zeroed after the first
      flushing.  So every cached percpu value is summed twice.  It creates a
      small error (up to 32 pages per cpu, but usually less) which accumulates
      on parent cgroup level.  After creating and destroying of thousands of
      child cgroups, slab counter on parent level can be way off the real
      value.
      
      For now, let's just stop flushing slab counters on memcg offlining.  It
      can't be done correctly without scheduling a work on each cpu: reading
      and zeroing it during css offlining can race with an asynchronous
      update, which doesn't expect values to be changed underneath.
      
      With this change, slab counters on parent level will become eventually
      consistent.  Once all dying children are gone, values are correct.  And
      if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.
      
      It's not perfect, as slab are reparented, so any updates after the
      reparenting will happen on the parent level.  It means that if a slab
      page was allocated, a counter on child level was bumped, then the page
      was reparented and freed, the annihilation of positive and negative
      counter values will not happen until the child cgroup is released.  It
      makes slab counters different from others, and it might want us to
      implement flushing in a correct form again.  But it's also a question of
      performance: scheduling a work on each cpu isn't free, and it's an open
      question if the benefit of having more accurate counters is worth it.
      
      We might also consider flushing all counters on offlining, not only slab
      counters.
      
      So let's fix the main problem now: make the slab counters eventually
      consistent, so at least the error won't grow with uptime (or more
      precisely the number of created and destroyed cgroups).  And think about
      the accuracy of counters separately.
      
      Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
      Fixes: bee07b33
      
       ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a87e2a2
  7. 05 Dec, 2019 1 commit
  8. 01 Dec, 2019 5 commits
  9. 16 Nov, 2019 1 commit
    • Roman Gushchin's avatar
      mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() · 00d484f3
      Roman Gushchin authored
      We've encountered a rcu stall in get_mem_cgroup_from_mm():
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
        (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
        <...>
        RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
        <...>
         __memcg_kmem_charge+0x55/0x140
         __alloc_pages_nodemask+0x267/0x320
         pipe_write+0x1ad/0x400
         new_sync_write+0x127/0x1c0
         __kernel_write+0x4f/0xf0
         dump_emit+0x91/0xc0
         writenote+0xa0/0xc0
         elf_core_dump+0x11af/0x1430
         do_coredump+0xc65/0xee0
         get_signal+0x132/0x7c0
         do_signal+0x36/0x640
         exit_to_usermode_loop+0x61/0xd0
         do_syscall_64+0xd4/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The problem is caused by an exiting task which is associated with an
      offline memcg.  We're iterating over and over in the do {} while
      (!css_tryget_online()) loop, but obviously the memcg won't become online
      and the exiting task won't be migrated to a live memcg.
      
      Let's fix it by switching from css_tryget_online() to css_tryget().
      
      As css_tryget_online() cannot guarantee that the memcg won't go offline,
      the check is usually useless, except some rare cases when for example it
      determines if something should be presented to a user.
      
      A similar problem is described by commit 18fa84a2 ("cgroup: Use
      css_tryget() instead of css_tryget_online() in task_get_css()").
      
      Johannes:
      
      : The bug aside, it doesn't matter whether the cgroup is online for the
      : callers.  It used to matter when offlining needed to evacuate all charges
      : from the memcg, and so needed to prevent new ones from showing up, but we
      : don't care now.
      
      Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.com
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarShakeel Butt <shakeeb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutn <mkoutny@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00d484f3
  10. 06 Nov, 2019 3 commits
    • Johannes Weiner's avatar
      mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges · 869712fd
      Johannes Weiner authored
      While upgrading from 4.16 to 5.2, we noticed these allocation errors in
      the log of the new kernel:
      
        SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
          cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
          node 0: slabs: 5, objs: 170, free: 0
      
              slab_out_of_memory+1
              ___slab_alloc+969
              __slab_alloc+14
              kmem_cache_alloc+346
              inet_twsk_alloc+60
              tcp_time_wait+46
              tcp_fin+206
              tcp_data_queue+2034
              tcp_rcv_state_process+784
              tcp_v6_do_rcv+405
              __release_sock+118
              tcp_close+385
              inet_release+46
              __sock_release+55
              sock_close+17
              __fput+170
              task_work_run+127
              exit_to_usermode_loop+191
              do_syscall_64+212
              entry_SYSCALL_64_after_hwframe+68
      
      accompanied by an increase in machines going completely radio silent
      under memory pressure.
      
      One thing that changed since 4.16 is e699e2c6 ("net, mm: account
      sock objects to kmemcg"), which made these slab caches subject to cgroup
      memory accounting and control.
      
      The problem with that is that cgroups, unlike the page allocator, do not
      maintain dedicated atomic reserves.  As a cgroup's usage hovers at its
      limit, atomic allocations - such as done during network rx - can fail
      consistently for extended periods of time.  The kernel is not able to
      operate under these conditions.
      
      We don't want to revert the culprit patch, because it indeed tracks a
      potentially substantial amount of memory used by a cgroup.
      
      We also don't want to implement dedicated atomic reserves for cgroups.
      There is no point in keeping a fixed margin of unused bytes in the
      cgroup's memory budget to accomodate a consumer that is impossible to
      predict - we'd be wasting memory and get into configuration headaches,
      not unlike what we have going with min_free_kbytes.  We do this for
      physical mem because we have to, but cgroups are an accounting game.
      
      Instead, account these privileged allocations to the cgroup, but let
      them bypass the configured limit if they have to.  This way, we get the
      benefits of accounting the consumed memory and have it exert pressure on
      the rest of the cgroup, but like with the page allocator, we shift the
      burden of reclaimining on behalf of atomic allocations onto the regular
      allocations that can block.
      
      Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org
      Fixes: e699e2c6
      
       ("net, mm: account sock objects to kmemcg")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      869712fd
    • Roman Gushchin's avatar
      mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly · 221ec5c0
      Roman Gushchin authored
      page_cgroup_ino() doesn't return a valid memcg pointer for non-compound
      slab pages, because it depends on PgHead AND PgSlab flags to be set to
      determine the memory cgroup from the kmem_cache.  It's correct for
      compound pages, but not for generic small pages.  Those don't have PgHead
      set, so it ends up returning zero.
      
      Fix this by replacing the condition to PageSlab() && !PageTail().
      
      Before this patch:
        [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
        0x0000000000000080	        38        0  _______S___________________________________	slab
      
      After this patch:
        [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
        0x0000000000000080	       147        0  _______S___________________________________	slab
      
      Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order
      to filter error injection events based on memcg.  So if
      page_cgroup_ino() fails to return memcg pointer, we just fail to inject
      memory error.  Considering that hwpoison filter is for testing, affected
      users are limited and the impact should be marginal.
      
      [n-horiguchi@ah.jp.nec.com: changelog additions]
      Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com
      Fixes: 4d96ba35
      
       ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      221ec5c0
    • Shakeel Butt's avatar
      mm: memcontrol: fix NULL-ptr deref in percpu stats flush · 7961eee3
      Shakeel Butt authored
      __mem_cgroup_free() can be called on the failure path in
      mem_cgroup_alloc().  However memcg_flush_percpu_vmstats() and
      memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free()
      access the fields of memcg which can potentially be null if called from
      failure path from mem_cgroup_alloc().  Indeed syzbot has reported the
      following crash:
      
      	kasan: CONFIG_KASAN_INLINE enabled
      	kasan: GPF could be caused by NULL-ptr deref or user memory access
      	general protection fault: 0000 [#1] PREEMPT SMP KASAN
      	CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0
      	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      	RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436
      	Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90
      	RSP: 0018:ffff888095c27980 EFLAGS: 00010206
      	RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000
      	RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007
      	RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33
      	R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760
      	R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007
      	FS:  00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	__mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021
      	mem_cgroup_free mm/memcontrol.c:5033 [inline]
      	mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160
      	css_create kernel/cgroup/cgroup.c:5156 [inline]
      	cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119
      	cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401
      	kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124
      	vfs_mkdir+0x42e/0x670 fs/namei.c:3807
      	do_mkdirat+0x234/0x2a0 fs/namei.c:3830
      	__do_sys_mkdir fs/namei.c:3846 [inline]
      	__se_sys_mkdir fs/namei.c:3844 [inline]
      	__x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844
      	do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
      	entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixing this by moving the flush to mem_cgroup_free as there is no need
      to flush anything if we see failure in mem_cgroup_alloc().
      
      Link: http://lkml.kernel.org/r/20191018165231.249872-1-shakeelb@google.com
      Fixes: bb65f89b ("mm: memcontrol: flush percpu vmevents before releasing memcg")
      Fixes: c350a99e
      
       ("mm: memcontrol: flush percpu vmstats before releasing memcg")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: syzbot+515d5bcfe179cdf049b2@syzkaller.appspotmail.com
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7961eee3
  11. 19 Oct, 2019 1 commit
  12. 09 Oct, 2019 1 commit
    • Qian Cai's avatar
      locking/lockdep: Remove unused @nested argument from lock_release() · 5facae4f
      Qian Cai authored
      Since the following commit:
      
        b4adfe8e
      
       ("locking/lockdep: Remove unused argument in __lock_release")
      
      @nested is no longer used in lock_release(), so remove it from all
      lock_release() calls and friends.
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: alexander.levin@microsoft.com
      Cc: daniel@iogearbox.net
      Cc: davem@davemloft.net
      Cc: dri-devel@lists.freedesktop.org
      Cc: duyuyang@gmail.com
      Cc: gregkh@linuxfoundation.org
      Cc: hannes@cmpxchg.org
      Cc: intel-gfx@lists.freedesktop.org
      Cc: jack@suse.com
      Cc: jlbec@evilplan.or
      Cc: joonas.lahtinen@linux.intel.com
      Cc: joseph.qi@linux.alibaba.com
      Cc: jslaby@suse.com
      Cc: juri.lelli@redhat.com
      Cc: maarten.lankhorst@linux.intel.com
      Cc: mark@fasheh.com
      Cc: mhocko@kernel.org
      Cc: mripard@kernel.org
      Cc: ocfs2-devel@oss.oracle.com
      Cc: rodrigo.vivi@intel.com
      Cc: sean@poorly.run
      Cc: st@kernel.org
      Cc: tj@kernel.org
      Cc: tytso@mit.edu
      Cc: vdavydov.dev@gmail.com
      Cc: vincent.guittot@linaro.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5facae4f
  13. 07 Oct, 2019 1 commit
    • Chris Down's avatar
      mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
      Chris Down authored
      cgroup v2 introduces two memory protection thresholds: memory.low
      (best-effort) and memory.min (hard protection).  While they generally do
      what they say on the tin, there is a limitation in their implementation
      that makes them difficult to use effectively: that cliff behaviour often
      manifests when they become eligible for reclaim.  This patch implements
      more intuitive and usable behaviour, where we gradually mount more
      reclaim pressure as cgroups further and further exceed their protection
      thresholds.
      
      This cliff edge behaviour happens because we only choose whether or not
      to reclaim based on whether the memcg is within its protection limits
      (see the use of mem_cgroup_protected in shrink_node), but we don't vary
      our reclaim behaviour based on this information.  Imagine the following
      timeline, with the numbers the lruvec size in this zone:
      
      1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
      2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
      3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
         scanned. (?!)
      
      * Of course, we won't usually scan all available pages in the zone even
        without this patch because of scan control priority, over-reclaim
        protection, etc.  However, as shown by the tests at the end, these
        techniques don't sufficiently throttle such an extreme change in input,
        so cliff-like behaviour isn't really averted by their existence alone.
      
      Here's an example of how this plays out in practice.  At Facebook, we are
      trying to protect various workloads from "system" software, like
      configuration management tools, metric collectors, etc (see this[0] case
      study).  In order to find a suitable memory.low value, we start by
      determining the expected memory range within which the workload will be
      comfortable operating.  This isn't an exact science -- memory usage deemed
      "comfortable" will vary over time due to user behaviour, differences in
      composition of work, etc, etc.  As such we need to ballpark memory.low,
      but doing this is currently problematic:
      
      1. If we end up setting it too low for the workload, it won't have
         *any* effect (see discussion above).  The group will receive the full
         weight of reclaim and won't have any priority while competing with the
         less important system software, as if we had no memory.low configured
         at all.
      
      2. Because of this behaviour, we end up erring on the side of setting
         it too high, such that the comfort range is reliably covered.  However,
         protected memory is completely unavailable to the rest of the system,
         so we might cause undue memory and IO pressure there when we *know* we
         have some elasticity in the workload.
      
      3. Even if we get the value totally right, smack in the middle of the
         comfort zone, we get extreme jumps between no pressure and full
         pressure that cause unpredictable pressure spikes in the workload due
         to the current binary reclaim behaviour.
      
      With this patch, we can set it to our ballpark estimation without too much
      worry.  Any undesirable behaviour, such as too much or too little reclaim
      pressure on the workload or system will be proportional to how far our
      estimation is off.  This means we can set memory.low much more
      conservatively and thus waste less resources *without* the risk of the
      workload falling off a cliff if we overshoot.
      
      As a more abstract technical description, this unintuitive behaviour
      results in having to give high-priority workloads a large protection
      buffer on top of their expected usage to function reliably, as otherwise
      we have abrupt periods of dramatically increased memory pressure which
      hamper performance.  Having to set these thresholds so high wastes
      resources and generally works against the principle of work conservation.
      In addition, having proportional memory reclaim behaviour has other
      benefits.  Most notably, before this patch it's basically mandatory to set
      memory.low to a higher than desirable value because otherwise as soon as
      you exceed memory.low, all protection is lost, and all pages are eligible
      to scan again.  By contrast, having a gradual ramp in reclaim pressure
      means that you now still get some protection when thresholds are exceeded,
      which means that one can now be more comfortable setting memory.low to
      lower values without worrying that all protection will be lost.  This is
      important because workingset size is really hard to know exactly,
      especially with variable workloads, so at least getting *some* protection
      if your workingset size grows larger than you expect increases user
      confidence in setting memory.low without a huge buffer on top being
      needed.
      
      Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
      assistance in thinking about how to make this work better.
      
      In testing these changes, I intended to verify that:
      
      1. Changes in page scanning become gradual and proportional instead of
         binary.
      
         To test this, I experimented stepping further and further down
         memory.low protection on a workload that floats around 19G workingset
         when under memory.low protection, watching page scan rates for the
         workload cgroup:
      
         +------------+-----------------+--------------------+--------------+
         | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
         +------------+-----------------+--------------------+--------------+
         |        21G |               0 |                  0 | N/A          |
         |        17G |             867 |               3799 | 23%          |
         |        12G |            1203 |               3543 | 34%          |
         |         8G |            2534 |               3979 | 64%          |
         |         4G |            3980 |               4147 | 96%          |
         |          0 |            3799 |               3980 | 95%          |
         +------------+-----------------+--------------------+--------------+
      
         As you can see, the test kernel (with a kernel containing this
         patch) ramps up page scanning significantly more gradually than the
         control kernel (without this patch).
      
      2. More gradual ramp up in reclaim aggression doesn't result in
         premature OOMs.
      
         To test this, I wrote a script that slowly increments the number of
         pages held by stress(1)'s --vm-keep mode until a production system
         entered severe overall memory contention.  This script runs in a highly
         protected slice taking up the majority of available system memory.
         Watching vmstat revealed that page scanning continued essentially
         nominally between test and control, without causing forward reclaim
         progress to become arrested.
      
      [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
      
      [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
      [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
        Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
      Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9783aa99
  14. 26 Sep, 2019 1 commit
    • Michal Hocko's avatar
      memcg, kmem: do not fail __GFP_NOFAIL charges · e55d9d9b
      Michal Hocko authored
      Thomas has noticed the following NULL ptr dereference when using cgroup
      v1 kmem limit:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 0
      P4D 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
      Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
      RIP: 0010:create_empty_buffers+0x24/0x100
      Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
      RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
      RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
      RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
      R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
      FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
      Call Trace:
       create_page_buffers+0x4d/0x60
       __block_write_begin_int+0x8e/0x5a0
       ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
       ? jbd2__journal_start+0xd7/0x1f0
       ext4_da_write_begin+0x112/0x3d0
       generic_perform_write+0xf1/0x1b0
       ? file_update_time+0x70/0x140
       __generic_file_write_iter+0x141/0x1a0
       ext4_file_write_iter+0xef/0x3b0
       __vfs_write+0x17e/0x1e0
       vfs_write+0xa5/0x1a0
       ksys_write+0x57/0xd0
       do_syscall_64+0x55/0x160
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
      fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
      behavior because nofail allocations are not allowed to fail.  Normal
      charge path simply forces the charge even if that means to cross the
      limit.  Kmem accounting should be doing the same.
      
      Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarThomas Lindroth <thomas.lindroth@gmail.com>
      Debugged-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e55d9d9b
  15. 24 Sep, 2019 7 commits
    • Yang Shi's avatar
      mm: thp: make deferred split shrinker memcg aware · 87eaceb3
      Yang Shi authored
      Currently THP deferred split shrinker is not memcg aware, this may cause
      premature OOM with some configuration.  For example the below test would
      run into premature OOM easily:
      
      $ cgcreate -g memory:thp
      $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
      $ cgexec -g memory:thp transhuge-stress 4000
      
      transhuge-stress comes from kernel selftest.
      
      It is easy to hit OOM, but there are still a lot THP on the deferred split
      queue, memcg direct reclaim can't touch them since the deferred split
      shrinker is not memcg aware.
      
      Convert deferred split shrinker memcg aware by introducing per memcg
      deferred split queue.  The THP should be on either per node or per memcg
      deferred split queue if it belongs to a memcg.  When the page is
      immigrated to the other memcg, it will be immigrated to the target memcg's
      deferred split queue too.
      
      Reuse the second tail page's deferred_list for per memcg list since the
      same THP can't be on multiple deferred split queues.
      
      [yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai]
        Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.com
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87eaceb3
    • Yang Shi's avatar
      mm: shrinker: make shrinker not depend on memcg kmem · 0a432dcb
      Yang Shi authored
      Currently shrinker is just allocated and can work when memcg kmem is
      enabled.  But, THP deferred split shrinker is not slab shrinker, it
      doesn't make too much sense to have such shrinker depend on memcg kmem.
      It should be able to reclaim THP even though memcg kmem is disabled.
      
      Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker.
      When memcg kmem is disabled, just such shrinkers can be called in
      shrinking memcg slab.
      
      [yang.shi@linux.alibaba.com: add comment]
        Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.alibaba.com
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a432dcb
    • Michal Hocko's avatar
      memcg, kmem: deprecate kmem.limit_in_bytes · 0158115f
      Michal Hocko authored
      Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
      which turned out to be really a bad idea because there are paths which
      cannot shrink the kernel memory usage enough to get below the limit (e.g.
      because the accounted memory is not reclaimable).  There are cases when
      the failure is even not allowed (e.g.  __GFP_NOFAIL).  This means that the
      kmem limit is in excess to the hard limit without any way to shrink and
      thus completely useless.  OOM killer cannot be invoked to handle the
      situation because that would lead to a premature oom killing.
      
      As a result many places might see ENOMEM returning from kmalloc and result
      in unexpected errors.  E.g.  a global OOM killer when there is a lot of
      free memory because ENOMEM is translated into VM_FAULT_OOM in #PF path and
      therefore pagefault_out_of_memory would result in OOM killer.
      
      Please note that the kernel memory is still accounted to the overall limit
      along with the user memory so removing the kmem specific limit should
      still allow to contain kernel memory consumption.  Unlike the kmem one,
      though, it invokes memory reclaim and targeted memcg oom killing if
      necessary.
      
      Start the deprecation process by crying to the kernel log.  Let's see
      whether there are relevant usecases and simply return to EINVAL in the
      second stage if nobody complains in few releases.
      
      [akpm@linux-foundation.org: tweak documentation text]
      Link: http://lkml.kernel.org/r/20190911151612.GI4023@dhcp22.suse.cz
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0158115f
    • Qian Cai's avatar
      mm/memcontrol.c: fix a -Wunused-function warning · 4d0e3230
      Qian Cai authored
      mem_cgroup_id_get() was introduced in commit 73f576c0 ("mm:memcontrol:
      fix cgroup creation failure after many small jobs").
      
      Later, it no longer has any user since the commits,
      
      1f47b61f ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
      58fa2a55 ("mm: memcontrol: add sanity checks for memcg->id.ref on get/put")
      
      so safe to remove it.
      
      Link: http://lkml.kernel.org/r/1568648453-5482-1-git-send-email-cai@lca.pw
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d0e3230
    • Roman Gushchin's avatar
      mm: memcontrol: switch to rcu protection in drain_all_stock() · e1a366be
      Roman Gushchin authored
      Commit 72f0184c ("mm, memcg: remove hotplug locking from try_charge")
      introduced css_tryget()/css_put() calls in drain_all_stock(), which are
      supposed to protect the target memory cgroup from being released during
      the mem_cgroup_is_descendant() call.
      
      However, it's not completely safe.  In theory, memcg can go away between
      reading stock->cached pointer and calling css_tryget().
      
      This can happen if drain_all_stock() races with drain_local_stock()
      performed on the remote cpu as a result of a work, scheduled by the
      previous invocation of drain_all_stock().
      
      The race is a bit theoretical and there are few chances to trigger it, but
      the current code looks a bit confusing, so it makes sense to fix it
      anyway.  The code looks like as if css_tryget() and css_put() are used to
      protect stocks drainage.  It's not necessary because stocked pages are
      holding references to the cached cgroup.  And it obviously won't work for
      works, scheduled on other cpus.
      
      So, let's read the stock->cached pointer and evaluate the memory cgroup
      inside a rcu read section, and get rid of css_tryget()/css_put() calls.
      
      Link: http://lkml.kernel.org/r/20190802192241.3253165-1-guro@fb.com
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1a366be
    • Chris Down's avatar
      mm, memcg: throttle allocators when failing reclaim over memory.high · 0e4b01df
      Chris Down authored
      We're trying to use memory.high to limit workloads, but have found that
      containment can frequently fail completely and cause OOM situations
      outside of the cgroup.  This happens especially with swap space -- either
      when none is configured, or swap is full.  These failures often also don't
      have enough warning to allow one to react, whether for a human or for a
      daemon monitoring PSI.
      
      Here is output from a simple program showing how long it takes in usec
      (column 2) to allocate a megabyte of anonymous memory (column 1) when a
      cgroup is already beyond its memory high setting, and no swap is
      available:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1035
          96  1038
          97  1000
          98  1036
          99  1048
          100 1590
          101 1968
          102 1776
          103 1863
          104 1757
          105 1921
          106 1893
          107 1760
          108 1748
          109 1843
          110 1716
          111 1924
          112 1776
          113 1831
          114 1766
          115 1836
          116 1588
          117 1912
          118 1802
          119 1857
          120 1731
          [...]
          [System OOM in 2-3 seconds]
      
      The delay does go up extremely marginally past the 100MB memory.high
      threshold, as now we spend time scanning before returning to usermode, but
      it's nowhere near enough to contain growth.  It also doesn't get worse the
      more pages you have, since it only considers nr_pages.
      
      The current situation goes against both the expectations of users of
      memory.high, and our intentions as cgroup v2 developers.  In
      cgroup-v2.txt, we claim that we will throttle and only under "extreme
      conditions" will memory.high protection be breached.  Likewise, cgroup v2
      users generally also expect that memory.high should throttle workloads as
      they exceed their high threshold.  However, as seen above, this isn't
      always how it works in practice -- even on banal setups like those with no
      swap, or where swap has become exhausted, we can end up with memory.high
      being breached and us having no weapons left in our arsenal to combat
      runaway growth with, since reclaim is futile.
      
      It's also hard for system monitoring software or users to tell how bad the
      situation is, as "high" events for the memcg may in some cases be benign,
      and in others be catastrophic.  The current status quo is that we fail
      containment in a way that doesn't provide any advance warning that things
      are about to go horribly wrong (for example, we are about to invoke the
      kernel OOM killer).
      
      This patch introduces explicit throttling when reclaim is failing to keep
      memcg size contained at the memory.high setting.  It does so by applying
      an exponential delay curve derived from the memcg's overage compared to
      memory.high.  In the normal case where the memcg is either below or only
      marginally over its memory.high setting, no throttling will be performed.
      
      This composes well with system health monitoring and remediation, as these
      allocator delays are factored into PSI's memory pressure calculations.
      This both creates a mechanism system administrators or applications
      consuming the PSI interface to trivially see that the memcg in question is
      struggling and use that to make more reasonable decisions, and permits
      them enough time to act.  Either of these can act with significantly more
      nuance than that we can provide using the system OOM killer.
      
      This is a similar idea to memory.oom_control in cgroup v1 which would put
      the cgroup to sleep if the threshold was violated, but it's also
      significantly improved as it results in visible memory pressure, and also
      doesn't schedule indefinitely, which previously made tracing and other
      introspection difficult (ie.  it's clamped at 2*HZ per allocation through
      MEMCG_MAX_HIGH_DELAY_JIFFIES).
      
      Contrast the previous results with a kernel with this patch:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1002
          96  1000
          97  1002
          98  1003
          99  1000
          100 1043
          101 84724
          102 330628
          103 610511
          104 1016265
          105 1503969
          106 2391692
          107 2872061
          108 3248003
          109 4791904
          110 5759832
          111 6912509
          112 8127818
          113 9472203
          114 12287622
          115 12480079
          116 14144008
          117 15808029
          118 16384500
          119 16383242
          120 16384979
          [...]
      
      As you can see, in the normal case, memory allocation takes around 1000
      usec.  However, as we exceed our memory.high, things start to increase
      exponentially, but fairly leniently at first.  Our first megabyte over
      memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
      next is almost an entire second.  This gets worse until we reach our
      eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
      However, this is still making forward progress, so permits tracing or
      further analysis with programs like GDB.
      
      We use an exponential curve for our delay penalty for a few reasons:
      
      1. We run mem_cgroup_handle_over_high to potentially do reclaim after
         we've already performed allocations, which means that temporarily
         going over memory.high by a small amount may be perfectly legitimate,
         even for compliant workloads. We don't want to unduly penalise such
         cases.
      2. An exponential curve (as opposed to a static or linear delay) allows
         ramping up memory pressure stats more gradually, which can be useful
         to work out that you have set memory.high too low, without destroying
         application performance entirely.
      
      This patch expands on earlier work by Johannes Weiner. Thanks!
      
      [akpm@linux-foundation.org: fix max() warning]
      [akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit]
      [akpm@linux-foundation.org: fix it even more]
      [chris@chrisdown.name: fix 64-bit divide even more]
      Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.name
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e4b01df
    • Matthew Wilcox (Oracle)'s avatar
      mm: introduce compound_nr() · d8c6546b
      Matthew Wilcox (Oracle) authored
      Replace 1 << compound_order(page) with compound_nr(page).  Minor
      improvements in readability.
      
      Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8c6546b
  16. 07 Sep, 2019 2 commits
  17. 31 Aug, 2019 3 commits
  18. 30 Aug, 2019 1 commit
  19. 27 Aug, 2019 1 commit
    • Tejun Heo's avatar
      writeback, memcg: Implement foreign dirty flushing · 97b27821
      Tejun Heo authored
      
      
      There's an inherent mismatch between memcg and writeback.  The former
      trackes ownership per-page while the latter per-inode.  This was a
      deliberate design decision because honoring per-page ownership in the
      writeback path is complicated, may lead to higher CPU and IO overheads
      and deemed unnecessary given that write-sharing an inode across
      different cgroups isn't a common use-case.
      
      Combined with inode majority-writer ownership switching, this works
      well enough in most cases but there are some pathological cases.  For
      example, let's say there are two cgroups A and B which keep writing to
      different but confined parts of the same inode.  B owns the inode and
      A's memory is limited far below B's.  A's dirty ratio can rise enough
      to trigger balance_dirty_pages() sleeps but B's can be low enough to
      avoid triggering background writeback.  A will be slowed down without
      a way to make writeback of the dirty pages happen.
      
      This patch implements foreign dirty recording and foreign mechanism so
      that when a memcg encounters a condition as above it can trigger
      flushes on bdi_writebacks which can clean its pages.  Please see the
      comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
      details.
      
      A reproducer follows.
      
      write-range.c::
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <sys/types.h>
      
        static const char *usage = "write-range FILE START SIZE\n";
      
        int main(int argc, char **argv)
        {
      	  int fd;
      	  unsigned long start, size, end, pos;
      	  char *endp;
      	  char buf[4096];
      
      	  if (argc < 4) {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  fd = open(argv[1], O_WRONLY);
      	  if (fd < 0) {
      		  perror("open");
      		  return 1;
      	  }
      
      	  start = strtoul(argv[2], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  size = strtoul(argv[3], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  end = start + size;
      
      	  while (1) {
      		  for (pos = start; pos < end; ) {
      			  long bread, bwritten = 0;
      
      			  if (lseek(fd, pos, SEEK_SET) < 0) {
      				  perror("lseek");
      				  return 1;
      			  }
      
      			  bread = read(0, buf, sizeof(buf) < end - pos ?
      					       sizeof(buf) : end - pos);
      			  if (bread < 0) {
      				  perror("read");
      				  return 1;
      			  }
      			  if (bread == 0)
      				  return 0;
      
      			  while (bwritten < bread) {
      				  long this;
      
      				  this = write(fd, buf + bwritten,
      					       bread - bwritten);
      				  if (this < 0) {
      					  perror("write");
      					  return 1;
      				  }
      
      				  bwritten += this;
      				  pos += bwritten;
      			  }
      		  }
      	  }
        }
      
      repro.sh::
      
        #!/bin/bash
      
        set -e
        set -x
      
        sysctl -w vm.dirty_expire_centisecs=300000
        sysctl -w vm.dirty_writeback_centisecs=300000
        sysctl -w vm.dirtytime_expire_seconds=300000
        echo 3 > /proc/sys/vm/drop_caches
      
        TEST=/sys/fs/cgroup/test
        A=$TEST/A
        B=$TEST/B
      
        mkdir -p $A $B
        echo "+memory +io" > $TEST/cgroup.subtree_control
        echo $((1<<30)) > $A/memory.high
        echo $((32<<30)) > $B/memory.high
      
        rm -f testfile
        touch testfile
        fallocate -l 4G testfile
      
        echo "Starting B"
      
        (echo $BASHPID > $B/cgroup.procs
         pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &
      
        echo "Waiting 10s to ensure B claims the testfile inode"
        sleep 5
        sync
        sleep 5
        sync
        echo "Starting A"
      
        (echo $BASHPID > $A/cgroup.procs
         pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))
      
      v2: Added comments explaining why the specific intervals are being used.
      
      v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort
          flushing while avoding possible livelocks.
      
      v4: Use get_jiffies_64() and time_before/after64() instead of raw
          jiffies_64 and arthimetic comparisons as suggested by Jan.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      97b27821
  20. 25 Aug, 2019 2 commits
    • Roman Gushchin's avatar
      mm: memcontrol: flush percpu vmevents before releasing memcg · bb65f89b
      Roman Gushchin authored
      Similar to vmstats, percpu caching of local vmevents leads to an
      accumulation of errors on non-leaf levels.  This happens because some
      leftovers may remain in percpu caches, so that they are never propagated
      up by the cgroup tree and just disappear into nonexistence with on
      releasing of the memory cgroup.
      
      To fix this issue let's accumulate and propagate percpu vmevents values
      before releasing the memory cgroup similar to what we're doing with
      vmstats.
      
      Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
      only over online cpus.
      
      Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com
      Fixes: 42a30035
      
       ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb65f89b
    • Roman Gushchin's avatar
      mm: memcontrol: flush percpu vmstats before releasing memcg · c350a99e
      Roman Gushchin authored
      Percpu caching of local vmstats with the conditional propagation by the
      cgroup tree leads to an accumulation of errors on non-leaf levels.
      
      Let's imagine two nested memory cgroups A and A/B.  Say, a process
      belonging to A/B allocates 100 pagecache pages on the CPU 0.  The percpu
      cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B
      and A atomic vmstat counters, 4 pages will remain in the percpu cache.
      
      Imagine A/B is nearby memory.max, so that every following allocation
      triggers a direct reclaim on the local CPU.  Say, each such attempt will
      free 16 pages on a new cpu.  That means every percpu cache will have -16
      pages, except the first one, which will have 4 - 16 = -12.  A/B and A
      atomic counters will not be touched at all.
      
      Now a user removes A/B.  All percpu caches are freed and corresponding
      vmstat numbers are forgotten.  A has 96 pages more than expected.
      
      As memory cgroups are created and destroyed, errors do accumulate.  Even
      1-2 pages differences can accumulate into large numbers.
      
      To fix this issue let's accumulate and propagate percpu vmstat values
      before releasing the memory cgroup.  At this point these numbers are
      stable and cannot be changed.
      
      Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
      only over online cpus.
      
      Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com
      Fixes: 42a30035
      
       ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c350a99e