• Michal Hocko's avatar
    mm, oom: do not rely on TIF_MEMDIE for memory reserves access · cd04ae1e
    Michal Hocko authored
    For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
    victims and then, among other things, to give these threads full access
    to memory reserves.  There are few shortcomings of this implementation,
    though.
    
    First of all and the most serious one is that the full access to memory
    reserves is quite dangerous because we leave no safety room for the
    system to operate and potentially do last emergency steps to move on.
    
    Secondly this flag is per task_struct while the OOM killer operates on
    mm_struct granularity so all processes sharing the given mm are killed.
    Giving the full access to all these task_structs could lead to a quick
    memory reserves depletion.  We have tried to reduce this risk by giving
    TIF_MEMDIE only to the main thread and the currently allocating task but
    that doesn't really solve this problem while it surely opens up a room
    for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
    allocator without access to memory reserves because a particular thread
    was not the group leader.
    
    Now that we have the oom reaper and that all oom victims are reapable
    after 1b51e65e ("oom, oom_reaper: allow to reap mm shared by the
    kthreads") we can be more conservative and grant only partial access to
    memory reserves because there are reasonable chances of the parallel
    memory freeing.  We still want some access to reserves because we do not
    want other consumers to eat up the victim's freed memory.  oom victims
    will still contend with __GFP_HIGH users but those shouldn't be so
    aggressive to starve oom victims completely.
    
    Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
    the half of the reserves.  This makes the access to reserves independent
    on which task has passed through mark_oom_victim.  Also drop any usage
    of TIF_MEMDIE from the page allocator proper and replace it by
    tsk_is_oom_victim as well which will make page_alloc.c completely
    TIF_MEMDIE free finally.
    
    CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
    ALLOC_NO_WATERMARKS approach.
    
    There is a demand to make the oom killer memcg aware which will imply
    many tasks killed at once.  This change will allow such a usecase
    without worrying about complete memory reserves depletion.
    
    Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
    Acked-by: 's avatarMel Gorman <mgorman@techsingularity.net>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Roman Gushchin <guro@fb.com>
    Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    cd04ae1e
internal.h 16.6 KB