1. 31 Jul, 2014 1 commit
    • Michal Hocko's avatar
      memcg: oom_notify use-after-free fix · 2bcf2e92
      Michal Hocko authored
      Paul Furtado has reported the following GPF:
      
        general protection fault: 0000 [#1] SMP
        Modules linked in: ipv6 dm_mod xen_netfront coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache raid0 xen_blkfront
        CPU: 3 PID: 3062 Comm: java Not tainted 3.16.0-rc5 #1
        task: ffff8801cfe8f170 ti: ffff8801d2ec4000 task.ti: ffff8801d2ec4000
        RIP: e030:mem_cgroup_oom_synchronize+0x140/0x240
        RSP: e02b:ffff8801d2ec7d48  EFLAGS: 00010283
        RAX: 0000000000000001 RBX: ffff88009d633800 RCX: 000000000000000e
        RDX: fffffffffffffffe RSI: ffff88009d630200 RDI: ffff88009d630200
        RBP: ffff8801d2ec7da8 R08: 0000000000000012 R09: 00000000fffffffe
        R10: 0000000000000000 R11: 0000000000000000 R12: ffff88009d633800
        R13: ffff8801d2ec7d48 R14: dead000000100100 R15: ffff88009d633a30
        FS:  00007f1748bb4700(0000) GS:ffff8801def80000(0000) knlGS:0000000000000000
        CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007f4110300308 CR3: 00000000c05f7000 CR4: 0000000000002660
        Call Trace:
          pagefault_out_of_memory+0x18/0x90
          mm_fault_error+0xa9/0x1a0
          __do_page_fault+0x478/0x4c0
          do_page_fault+0x2c/0x40
          page_fault+0x28/0x30
        Code: 44 00 00 48 89 df e8 40 ca ff ff 48 85 c0 49 89 c4 74 35 4c 8b b0 30 02 00 00 4c 8d b8 30 02 00 00 4d 39 fe 74 1b 0f 1f 44 00 00 <49> 8b 7e 10 be 01 00 00 00 e8 42 d2 04 00 4d 8b 36 4d 39 fe 75
        RIP  mem_cgroup_oom_synchronize+0x140/0x240
      
      Commit fb2a6fc5 ("mm: memcg: rework and document OOM waiting and
      wakeup") has moved mem_cgroup_oom_notify outside of memcg_oom_lock
      assuming it is protected by the hierarchical OOM-lock.
      
      Although this is true for the notification part the protection doesn't
      cover unregistration of event which can happen in parallel now so
      mem_cgroup_oom_notify can see already unlinked and/or freed
      mem_cgroup_eventfd_list.
      
      Fix this by using memcg_oom_lock also in mem_cgroup_oom_notify.
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=80881
      
      Fixes: fb2a6fc5
      
       (mm: memcg: rework and document OOM waiting and wakeup)
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reported-by: default avatarPaul Furtado <paulfurtado91@gmail.com>
      Tested-by: default avatarPaul Furtado <paulfurtado91@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bcf2e92
  2. 06 Jun, 2014 3 commits
  3. 04 Jun, 2014 16 commits
    • Hugh Dickins's avatar
      mm, memcg: periodically schedule when emptying page list · 2a7a0e0f
      Hugh Dickins authored
      
      
      mem_cgroup_force_empty_list() can iterate a large number of pages on an
      lru and mem_cgroup_move_parent() doesn't return an errno unless certain
      criteria, none of which indicate that the iteration may be taking too
      long, is met.
      
      We have encountered the following stack trace many times indicating
      "need_resched set for > 51000020 ns (51 ticks) without schedule", for
      example:
      
      	scheduler_tick()
      	<timer irq>
      	mem_cgroup_move_account+0x4d/0x1d5
      	mem_cgroup_move_parent+0x8d/0x109
      	mem_cgroup_reparent_charges+0x149/0x2ba
      	mem_cgroup_css_offline+0xeb/0x11b
      	cgroup_offline_fn+0x68/0x16b
      	process_one_work+0x129/0x350
      
      If this iteration is taking too long, we still need to do cond_resched()
      even when an individual page is not busy.
      
      [rientjes@google.com: changelog]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a7a0e0f
    • Vladimir Davydov's avatar
      memcg: cleanup kmem cache creation/destruction functions naming · 776ed0f0
      Vladimir Davydov authored
      
      
      Current names are rather inconsistent. Let's try to improve them.
      
      Brief change log:
      
      ** old name **                          ** new name **
      
      kmem_cache_create_memcg                 memcg_create_kmem_cache
      memcg_kmem_create_cache                 memcg_regsiter_cache
      memcg_kmem_destroy_cache                memcg_unregister_cache
      
      kmem_cache_destroy_memcg_children       memcg_cleanup_cache_params
      mem_cgroup_destroy_all_caches           memcg_unregister_all_caches
      
      create_work                             memcg_register_cache_work
      memcg_create_cache_work_func            memcg_register_cache_func
      memcg_create_cache_enqueue              memcg_schedule_register_cache
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      776ed0f0
    • Vladimir Davydov's avatar
      memcg: memcg_kmem_create_cache: make memcg_name_buf statically allocated · 93f39eea
      Vladimir Davydov authored
      
      
      It isn't worth complicating the code by allocating it on the first access,
      because it only takes 256 bytes.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93f39eea
    • Vladimir Davydov's avatar
      memcg: get rid of memcg_create_cache_name · 073ee1c6
      Vladimir Davydov authored
      
      
      Instead of calling back to memcontrol.c from kmem_cache_create_memcg in
      order to just create the name of a per memcg cache, let's allocate it in
      place.  We only need to pass the memcg name to kmem_cache_create_memcg for
      that - everything else can be done in slab_common.c.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      073ee1c6
    • Qiang Huang's avatar
      memcg: correct comments for __mem_cgroup_begin_update_page_stat · b5ffc856
      Qiang Huang authored
      
      Signed-off-by: default avatarQiang Huang <h.huangqiang@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5ffc856
    • Qiang Huang's avatar
      memcg: fold mem_cgroup_stolen · bdcbb659
      Qiang Huang authored
      
      
      It is only used in __mem_cgroup_begin_update_page_stat(), the name is
      confusing and 2 routines for one thing also confuse people, so fold this
      function seems more clear.
      
      [akpm@linux-foundation.org: fix typo, per Michal]
      Signed-off-by: default avatarQiang Huang <h.huangqiang@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bdcbb659
    • Fabian Frederick's avatar
      mm/memcontrol.c: remove NULL assignment on static · ada4ba59
      Fabian Frederick authored
      
      
      static values are automatically initialized to NULL
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ada4ba59
    • Christoph Lameter's avatar
      mm: replace __get_cpu_var uses with this_cpu_ptr · 7c8e0181
      Christoph Lameter authored
      
      
      Replace places where __get_cpu_var() is used for an address calculation
      with this_cpu_ptr().
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c8e0181
    • Vladimir Davydov's avatar
      memcg, slab: simplify synchronization scheme · bd673145
      Vladimir Davydov authored
      
      
      At present, we have the following mutexes protecting data related to per
      memcg kmem caches:
      
       - slab_mutex.  This one is held during the whole kmem cache creation
         and destruction paths.  We also take it when updating per root cache
         memcg_caches arrays (see memcg_update_all_caches).  As a result, taking
         it guarantees there will be no changes to any kmem cache (including per
         memcg).  Why do we need something else then?  The point is it is
         private to slab implementation and has some internal dependencies with
         other mutexes (get_online_cpus).  So we just don't want to rely upon it
         and prefer to introduce additional mutexes instead.
      
       - activate_kmem_mutex.  Initially it was added to synchronize
         initializing kmem limit (memcg_activate_kmem).  However, since we can
         grow per root cache memcg_caches arrays only on kmem limit
         initialization (see memcg_update_all_caches), we also employ it to
         protect against memcg_caches arrays relocation (e.g.  see
         __kmem_cache_destroy_memcg_children).
      
       - We have a convention not to take slab_mutex in memcontrol.c, but we
         want to walk over per memcg memcg_slab_caches lists there (e.g.  for
         destroying all memcg caches on offline).  So we have per memcg
         slab_caches_mutex's protecting those lists.
      
      The mutexes are taken in the following order:
      
         activate_kmem_mutex -> slab_mutex -> memcg::slab_caches_mutex
      
      Such a syncrhonization scheme has a number of flaws, for instance:
      
       - We can't call kmem_cache_{destroy,shrink} while walking over a
         memcg::memcg_slab_caches list due to locking order.  As a result, in
         mem_cgroup_destroy_all_caches we schedule the
         memcg_cache_params::destroy work shrinking and destroying the cache.
      
       - We don't have a mutex to synchronize per memcg caches destruction
         between memcg offline (mem_cgroup_destroy_all_caches) and root cache
         destruction (__kmem_cache_destroy_memcg_children).  Currently we just
         don't bother about it.
      
      This patch simplifies it by substituting per memcg slab_caches_mutex's
      with the global memcg_slab_mutex.  It will be held whenever a new per
      memcg cache is created or destroyed, so it protects per root cache
      memcg_caches arrays and per memcg memcg_slab_caches lists.  The locking
      order is following:
      
         activate_kmem_mutex -> memcg_slab_mutex -> slab_mutex
      
      This allows us to call kmem_cache_{create,shrink,destroy} under the
      memcg_slab_mutex.  As a result, we don't need memcg_cache_params::destroy
      work any more - we can simply destroy caches while iterating over a per
      memcg slab caches list.
      
      Also using the global mutex simplifies synchronization between concurrent
      per memcg caches creation/destruction, e.g.  mem_cgroup_destroy_all_caches
      vs __kmem_cache_destroy_memcg_children.
      
      The downside of this is that we substitute per-memcg slab_caches_mutex's
      with a hummer-like global mutex, but since we already take either the
      slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
      shouldn't hurt concurrency a lot.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd673145
    • Vladimir Davydov's avatar
      memcg, slab: merge memcg_{bind,release}_pages to memcg_{un}charge_slab · c67a8a68
      Vladimir Davydov authored
      
      
      Currently we have two pairs of kmemcg-related functions that are called on
      slab alloc/free.  The first is memcg_{bind,release}_pages that count the
      total number of pages allocated on a kmem cache.  The second is
      memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
      counter.  Let's just merge them to keep the code clean.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c67a8a68
    • Vladimir Davydov's avatar
      memcg, slab: do not schedule cache destruction when last page goes away · 1e32e77f
      Vladimir Davydov authored
      This patchset is a part of preparations for kmemcg re-parenting.  It
      targets at simplifying kmemcg work-flows and synchronization.
      
      First, it removes async per memcg cache destruction (see patches 1, 2).
      Now caches are only destroyed on memcg offline.  That means the caches
      that are not empty on memcg offline will be leaked.  However, they are
      already leaked, because memcg_cache_params::nr_pages normally never drops
      to 0 so the destruction work is never scheduled except kmem_cache_shrink
      is called explicitly.  In the future I'm planning reaping such dead caches
      on vmpressure or periodically.
      
      Second, it substitutes per memcg slab_caches_mutex's with the global
      memcg_slab_mutex, which should be taken during the whole per memcg cache
      creation/destruction path before the slab_mutex (see patch 3).  This
      greatly simplifies synchronization among various per memcg cache
      creation/destruction paths.
      
      I'm still not quite sure about the end picture, in particular I don't know
      whether we should reap dead memcgs' kmem caches periodically or try to
      merge them with their parents (see https://lkml.org/lkml/2014/4/20/38
      
       for
      more details), but whichever way we choose, this set looks like a
      reasonable change to me, because it greatly simplifies kmemcg work-flows
      and eases further development.
      
      This patch (of 3):
      
      After a memcg is offlined, we mark its kmem caches that cannot be deleted
      right now due to pending objects as dead by setting the
      memcg_cache_params::dead flag, so that memcg_release_pages will schedule
      cache destruction (memcg_cache_params::destroy) as soon as the last slab
      of the cache is freed (memcg_cache_params::nr_pages drops to zero).
      
      I guess the idea was to destroy the caches as soon as possible, i.e.
      immediately after freeing the last object.  However, it just doesn't work
      that way, because kmem caches always preserve some pages for the sake of
      performance, so that nr_pages never gets to zero unless the cache is
      shrunk explicitly using kmem_cache_shrink.  Of course, we could account
      the total number of objects on the cache or check if all the slabs
      allocated for the cache are empty on kmem_cache_free and schedule
      destruction if so, but that would be too costly.
      
      Thus we have a piece of code that works only when we explicitly call
      kmem_cache_shrink, but complicates the whole picture a lot.  Moreover,
      it's racy in fact.  For instance, kmem_cache_shrink may free the last slab
      and thus schedule cache destruction before it finishes checking that the
      cache is empty, which can lead to use-after-free.
      
      So I propose to remove this async cache destruction from
      memcg_release_pages, and check if the cache is empty explicitly after
      calling kmem_cache_shrink instead.  This will simplify things a lot w/o
      introducing any functional changes.
      
      And regarding dead memcg caches (i.e.  those that are left hanging around
      after memcg offline for they have objects), I suppose we should reap them
      either periodically or on vmpressure as Glauber suggested initially.  I'm
      going to implement this later.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e32e77f
    • Michal Hocko's avatar
      memcg: do not hang on OOM when killed by userspace OOM access to memory reserves · d8dc595c
      Michal Hocko authored
      Eric has reported that he can see task(s) stuck in memcg OOM handler
      regularly.  The only way out is to
      
      	echo 0 > $GROUP/memory.oom_control
      
      His usecase is:
      
      - Setup a hierarchy with memory and the freezer (disable kernel oom and
        have a process watch for oom).
      
      - In that memory cgroup add a process with one thread per cpu.
      
      - In one thread slowly allocate once per second I think it is 16M of ram
        and mlock and dirty it (just to force the pages into ram and stay
        there).
      
      - When oom is achieved loop:
        * attempt to freeze all of the tasks.
        * if frozen send every task SIGKILL, unfreeze, remove the directory in
          cgroupfs.
      
      Eric has then pinpointed the issue to be memcg specific.
      
      All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
      Those that have received fatal signal will bypass the charge and should
      continue on their way out.  The tricky part is that the exit path might
      trigger a page fault (e.g.  exit_robust_list), thus the memcg charge,
      while its memcg is still under OOM because nobody has released any charges
      yet.
      
      Unlike with the in-kernel OOM handler the exiting task doesn't get
      TIF_MEMDIE set so it doesn't shortcut further charges of the killed task
      and falls to the memcg OOM again without any way out of it as there are no
      fatal signals pending anymore.
      
      This patch fixes the issue by checking PF_EXITING early in
      mem_cgroup_try_charge and bypass the charge same as if it had fatal
      signal pending or TIF_MEMDIE set.
      
      Normally exiting tasks (aka not killed) will bypass the charge now but
      this should be OK as the task is leaving and will release memory and
      increasing the memory pressure just to release it in a moment seems
      dubious wasting of cycles.  Besides that charges after exit_signals should
      be rare.
      
      I am bringing this patch again (rebased on the current mmotm tree). I
      hope we can move forward finally. If there is still an opposition then
      I would really appreciate a concurrent approach so that we can discuss
      alternatives.
      
      http://comments.gmane.org/gmane.linux.kernel.stable/77650
      
       is a reference
      to the followup discussion when the patch has been dropped from the mmotm
      last time.
      Reported-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8dc595c
    • Vladimir Davydov's avatar
      memcg: un-export __memcg_kmem_get_cache · e8d9df3a
      Vladimir Davydov authored
      
      
      It is only used in slab and should not be used anywhere else so there is
      no need in exporting it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8d9df3a
    • Johannes Weiner's avatar
      mm: memcontrol: remove hierarchy restrictions for swappiness and oom_control · 3dae7fec
      Johannes Weiner authored
      
      
      Per-memcg swappiness and oom killing can currently not be tweaked on a
      memcg that is part of a hierarchy, but not the root of that hierarchy.
      Users have complained that they can't configure this when they turned on
      hierarchy mode.  In fact, with hierarchy mode becoming the default, this
      restriction disables the tunables entirely.
      
      But there is no good reason for this restriction.  The settings for
      swappiness and OOM killing are taken from whatever memcg whose limit
      triggered reclaim and OOM invocation, regardless of its position in the
      hierarchy tree.
      
      Allow setting swappiness on any group.  The knob on the root memcg
      already reads the global VM swappiness, make it writable as well.
      
      Allow disabling the OOM killer on any non-root memcg.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3dae7fec
    • Vladimir Davydov's avatar
      mm: get rid of __GFP_KMEMCG · 52383431
      Vladimir Davydov authored
      
      
      Currently to allocate a page that should be charged to kmemcg (e.g.
      threadinfo), we pass __GFP_KMEMCG flag to the page allocator.  The page
      allocated is then to be freed by free_memcg_kmem_pages.  Apart from
      looking asymmetrical, this also requires intrusion to the general
      allocation path.  So let's introduce separate functions that will
      alloc/free pages charged to kmemcg.
      
      The new functions are called alloc_kmem_pages and free_kmem_pages.  They
      should be used when the caller actually would like to use kmalloc, but
      has to fall back to the page allocator for the allocation is large.
      They only differ from alloc_pages and free_pages in that besides
      allocating or freeing pages they also charge them to the kmem resource
      counter of the current memory cgroup.
      
      [sfr@canb.auug.org.au: export kmalloc_order() to modules]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52383431
    • Vladimir Davydov's avatar
      sl[au]b: charge slabs to kmemcg explicitly · 5dfb4175
      Vladimir Davydov authored
      
      
      We have only a few places where we actually want to charge kmem so
      instead of intruding into the general page allocation path with
      __GFP_KMEMCG it's better to explictly charge kmem there.  All kmem
      charges will be easier to follow that way.
      
      This is a step towards removing __GFP_KMEMCG.  It removes __GFP_KMEMCG
      from memcg caches' allocflags.  Instead it makes slab allocation path
      call memcg_charge_kmem directly getting memcg to charge from the cache's
      memcg params.
      
      This also eliminates any possibility of misaccounting an allocation
      going from one memcg's cache to another memcg, because now we always
      charge slabs against the memcg the cache belongs to.  That's why this
      patch removes the big comment to memcg_kmem_get_cache.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5dfb4175
  4. 23 May, 2014 1 commit
    • Michal Hocko's avatar
      memcg: fix swapcache charge from kernel thread context · 6f6acb00
      Michal Hocko authored
      Commit 284f39af ("mm: memcg: push !mm handling out to page cache
      charge function") explicitly checks for page cache charges without any
      mm context (from kernel thread context[1]).
      
      This seemed to be the only possible case where memory could be charged
      without mm context so commit 03583f1a ("memcg: remove unnecessary
      !mm check from try_get_mem_cgroup_from_mm()") removed the mm check from
      get_mem_cgroup_from_mm().  This however caused another NULL ptr
      dereference during early boot when loopback kernel thread splices to
      tmpfs as reported by Stephan Kulow:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000360
        IP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
        Oops: 0000 [#1] SMP
        Modules linked in: btrfs dm_multipath dm_mod scsi_dh multipath raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod parport_pc parport nls_utf8 isofs usb_storage iscsi_ibft iscsi_boot_sysfs arc4 ecb fan thermal nfs lockd fscache nls_iso8859_1 nls_cp437 sg st hid_generic usbhid af_packet sunrpc sr_mod cdrom ata_generic uhci_hcd virtio_net virtio_blk ehci_hcd usbcore ata_piix floppy processor button usb_common virtio_pci virtio_ring virtio edd squashfs loop ppa]
        CPU: 0 PID: 97 Comm: loop1 Not tainted 3.15.0-rc5-5-default #1
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          __mem_cgroup_try_charge_swapin+0x40/0xe0
          mem_cgroup_charge_file+0x8b/0xd0
          shmem_getpage_gfp+0x66b/0x7b0
          shmem_file_splice_read+0x18f/0x430
          splice_direct_to_actor+0xa2/0x1c0
          do_lo_receive+0x5a/0x60 [loop]
          loop_thread+0x298/0x720 [loop]
          kthread+0xc6/0xe0
          ret_from_fork+0x7c/0xb0
      
      Also Branimir Maksimovic reported the following oops which is tiggered
      for the swapcache charge path from the accounting code for kernel threads:
      
        CPU: 1 PID: 160 Comm: kworker/u8:5 Tainted: P           OE 3.15.0-rc5-core2-custom #159
        Hardware name: System manufacturer System Product Name/MAXIMUSV GENE, BIOS 1903 08/19/2013
        task: ffff880404e349b0 ti: ffff88040486a000 task.ti: ffff88040486a000
        RIP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
        Call Trace:
          __mem_cgroup_try_charge_swapin+0x45/0xf0
          mem_cgroup_charge_file+0x9c/0xe0
          shmem_getpage_gfp+0x62c/0x770
          shmem_write_begin+0x38/0x40
          generic_perform_write+0xc5/0x1c0
          __generic_file_aio_write+0x1d1/0x3f0
          generic_file_aio_write+0x4f/0xc0
          do_sync_write+0x5a/0x90
          do_acct_process+0x4b1/0x550
          acct_process+0x6d/0xa0
          do_exit+0x827/0xa70
          kthread+0xc3/0xf0
      
      This patch fixes the issue by reintroducing mm check into
      get_mem_cgroup_from_mm.  We could do the same trick in
      __mem_cgroup_try_charge_swapin as we do for the regular page cache path
      but it is not worth troubles.  The check is not that expensive and it is
      better to have get_mem_cgroup_from_mm more robust.
      
      [1] - http://marc.info/?l=linux-mm&m=139463617808941&w=2
      
      Fixes: 03583f1a
      
       ("memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()")
      Reported-and-tested-by: default avatarStephan Kulow <coolo@suse.com>
      Reported-by: default avatarBranimir Maksimovic <branimir.maksimovic@gmail.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f6acb00
  5. 16 May, 2014 3 commits
    • Tejun Heo's avatar
      memcg: update memcg_has_children() to use css_next_child() · ea280e7b
      Tejun Heo authored
      
      
      Currently, memcg_has_children() and mem_cgroup_hierarchy_write()
      directly test cgroup->children for list emptiness.  It's semantically
      correct in traditional hierarchies as it actually wants to test for
      any children dead or alive; however, cgroup->children is not a
      published field and scheduled to go away.
      
      This patch moves out .use_hierarchy test out of memcg_has_children()
      and updates it to use css_next_child() to test whether there exists
      any children.  With .use_hierarchy test moved out, it can also be used
      by mem_cgroup_hierarchy_write().
      
      A side note: As .use_hierarchy is going away, it doesn't really matter
      but I'm not sure about how it's used in __memcg_activate_kmem().  The
      condition tested by memcg_has_children() is mushy when seen from
      userland as its result is affected by dead csses which aren't visible
      from userland.  I think the rule would be a lot clearer if we have a
      dedicated "freshly minted" flag which gets cleared when the first task
      is migrated into it or the first child is created and then gate
      activation with that.
      
      v2: Added comment noting that testing use_hierarchy is the
          responsibility of the callers of memcg_has_children() as suggested
          by Michal Hocko.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      ea280e7b
    • Michal Hocko's avatar
      memcg: remove tasks/children test from mem_cgroup_force_empty() · f61c42a7
      Michal Hocko authored
      
      
      Tejun has correctly pointed out that tasks/children test in
      mem_cgroup_force_empty is not correct because there is no other locking
      which preserves this state throughout the rest of the function so both
      new tasks can join the group or new children groups can be added while
      somebody is writing to memory.force_empty. A new task would break
      mem_cgroup_reparent_charges expectation that all failures as described
      by mem_cgroup_force_empty_list are temporal and there is no way out.
      
      The main use case for the knob as described by
      Documentation/cgroups/memory.txt is to:
      "
        The typical use case for this interface is before calling rmdir().
        Because rmdir() moves all pages to parent, some out-of-use page caches can be
        moved to the parent. If you want to avoid that, force_empty will be useful.
      "
      
      This means that reparenting is not really required as rmdir will
      reparent pages implicitly from the safe context. If we remove it from
      mem_cgroup_force_empty then we are safe even with existing tasks because
      the number of reclaim attempts is bounded. Moreover the knob still does
      what the documentation claims (modulo reparenting which doesn't make any
      difference) and users might expect. Longterm we want to deprecate the
      whole knob and put the reparented pages to the tail of parent LRU during
      cgroup removal.
      
      tj: Removed unused variable @cgrp from mem_cgroup_force_empty()
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f61c42a7
    • Tejun Heo's avatar
      cgroup: remove css_parent() · 5c9d535b
      Tejun Heo authored
      
      
      cgroup in general is moving towards using cgroup_subsys_state as the
      fundamental structural component and css_parent() was introduced to
      convert from using cgroup->parent to css->parent.  It was quite some
      time ago and we're moving forward with making css more prominent.
      
      This patch drops the trivial wrapper css_parent() and let the users
      dereference css->parent.  While at it, explicitly mark fields of css
      which are public and immutable.
      
      v2: New usage from device_cgroup.c converted.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatar"David S. Miller" <davem@davemloft.net>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      5c9d535b
  6. 13 May, 2014 3 commits
    • Tejun Heo's avatar
      cgroup: replace cftype->trigger() with cftype->write() · 6770c64e
      Tejun Heo authored
      
      
      cftype->trigger() is pointless.  It's trivial to ignore the input
      buffer from a regular ->write() operation.  Convert all ->trigger()
      users to ->write() and remove ->trigger().
      
      This patch doesn't introduce any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      6770c64e
    • Tejun Heo's avatar
      cgroup: replace cftype->write_string() with cftype->write() · 451af504
      Tejun Heo authored
      
      
      Convert all cftype->write_string() users to the new cftype->write()
      which maps directly to kernfs write operation and has full access to
      kernfs and cgroup contexts.  The conversions are mostly mechanical.
      
      * @css and @cft are accessed using of_css() and of_cft() accessors
        respectively instead of being specified as arguments.
      
      * Should return @nbytes on success instead of 0.
      
      * @buf is not trimmed automatically.  Trim if necessary.  Note that
        blkcg and netprio don't need this as the parsers already handle
        whitespaces.
      
      cftype->write_string() has no user left after the conversions and
      removed.
      
      While at it, remove unnecessary local variable @p in
      cgroup_subtree_control_write() and stale comment about
      CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.
      
      This patch doesn't introduce any visible behavior changes.
      
      v2: netprio was missing from conversion.  Converted.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      451af504
    • Tejun Heo's avatar
      cgroup: rename css_tryget*() to css_tryget_online*() · ec903c0c
      Tejun Heo authored
      
      
      Unlike the more usual refcnting, what css_tryget() provides is the
      distinction between online and offline csses instead of protection
      against upping a refcnt which already reached zero.  cgroup is
      planning to provide actual tryget which fails if the refcnt already
      reached zero.  Let's rename the existing trygets so that they clearly
      indicate that they're onliness.
      
      I thought about keeping the existing names as-are and introducing new
      names for the planned actual tryget; however, given that each
      controller participates in the synchronization of the online state, it
      seems worthwhile to make it explicit that these functions are about
      on/offline state.
      
      Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
      to css_tryget_online_from_dir().  This is pure rename.
      
      v2: cgroup_freezer grew new usages of css_tryget().  Update
          accordingly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      ec903c0c
  7. 06 May, 2014 1 commit
    • Johannes Weiner's avatar
      mm: filemap: update find_get_pages_tag() to deal with shadow entries · 139b6a6f
      Johannes Weiner authored
      Dave Jones reports the following crash when find_get_pages_tag() runs
      into an exceptional entry:
      
        kernel BUG at mm/filemap.c:1347!
        RIP: find_get_pages_tag+0x1cb/0x220
        Call Trace:
          find_get_pages_tag+0x36/0x220
          pagevec_lookup_tag+0x21/0x30
          filemap_fdatawait_range+0xbe/0x1e0
          filemap_fdatawait+0x27/0x30
          sync_inodes_sb+0x204/0x2a0
          sync_inodes_one_sb+0x19/0x20
          iterate_supers+0xb2/0x110
          sys_sync+0x44/0xb0
          ia32_do_call+0x13/0x13
      
        1343                         /*
        1344                          * This function is never used on a shmem/tmpfs
        1345                          * mapping, so a swap entry won't be found here.
        1346                          */
        1347                         BUG();
      
      After commit 0cd6144a ("mm + fs: prepare for non-page entries in
      page cache radix trees") this comment and BUG() are out of date because
      exceptional entries can now appear in all mappings - as shadows of
      recently evicted pages.
      
      However, as Hugh Dickins notes,
      
        "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
         any other PAGECACHE_TAG_*) to appear on an exceptional entry.
      
         I expect it comes down to an occasional race in RCU lookup of the
         radix_tree: lacking absolute synchronization, we might sometimes
         catch an exceptional entry, with the tag which really belongs with
         the unexceptional entry which was there an instant before."
      
      And indeed, not only is the tree walk lockless, the tags are also read
      in chunks, one radix tree node at a time.  There is plenty of time for
      page reclaim to swoop in and replace a page that was already looked up
      as tagged with a shadow entry.
      
      Remove the BUG() and update the comment.  While reviewing all other
      lookup sites for whether they properly deal with shadow entries of
      evicted pages, update all the comments and fix memcg file charge moving
      to not miss shmem/tmpfs swapcache pages.
      
      Fixes: 0cd6144a
      
       ("mm + fs: prepare for non-page entries in page cache radix trees")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      139b6a6f
  8. 04 May, 2014 2 commits
    • Tejun Heo's avatar
      cgroup, memcg: implement css->id and convert css_from_id() to use it · 15a4c835
      Tejun Heo authored
      
      
      Until now, cgroup->id has been used to identify all the associated
      csses and css_from_id() takes cgroup ID and returns the matching css
      by looking up the cgroup and then dereferencing the css associated
      with it; however, now that the lifetimes of cgroup and css are
      separate, this is incorrect and breaks on the unified hierarchy when a
      controller is disabled and enabled back again before the previous
      instance is released.
      
      This patch adds css->id which is a subsystem-unique ID and converts
      css_from_id() to look up by the new css->id instead.  memcg is the
      only user of css_from_id() and also converted to use css->id instead.
      
      For traditional hierarchies, this shouldn't make any functional
      difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      15a4c835
    • Tejun Heo's avatar
      cgroup, memcg: allocate cgroup ID from 1 · 7d699ddb
      Tejun Heo authored
      
      
      Currently, cgroup->id is allocated from 0, which is always assigned to
      the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
      invalid IDs and ends up incrementing all IDs by one.
      
      It's reasonable to reserve 0 for special purposes.  This patch updates
      cgroup core so that ID 0 is not used and the root cgroups get ID 1.
      The ID incrementing is removed form memcg.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      7d699ddb
  9. 07 Apr, 2014 10 commits