Skip to content
Snippets Groups Projects
Select Git revision
  • d8c19014bba8f565d8a2f1f46b4e38d1d97bf1a7
  • vme-testing default
  • ci-test
  • master
  • remoteproc
  • am625-sk-ov5640
  • pcal6534-upstreaming
  • lps22df-upstreaming
  • msc-upstreaming
  • imx8mp
  • iio/noa1305
  • vme-next
  • vme-next-4.14-rc4
  • v4.14-rc4
  • v4.14-rc3
  • v4.14-rc2
  • v4.14-rc1
  • v4.13
  • vme-next-4.13-rc7
  • v4.13-rc7
  • v4.13-rc6
  • v4.13-rc5
  • v4.13-rc4
  • v4.13-rc3
  • v4.13-rc2
  • v4.13-rc1
  • v4.12
  • v4.12-rc7
  • v4.12-rc6
  • v4.12-rc5
  • v4.12-rc4
  • v4.12-rc3
32 results

page_alloc.c

Blame
    • Dongli Zhang's avatar
      d8c19014
      page_frag: Recover from memory pressure · d8c19014
      Dongli Zhang authored
      The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb().
      This ends up to page_frag_alloc() to allocate skb->data from
      page_frag_cache->va.
      
      During the memory pressure, page_frag_cache->va may be allocated as
      pfmemalloc page. As a result, the skb->pfmemalloc is always true as
      skb->data is from page_frag_cache->va. The skb will be dropped if the
      sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour
      under memory pressure.
      
      However, once kernel is not under memory pressure any longer (suppose large
      amount of memory pages are just reclaimed), the page_frag_alloc() may still
      re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a
      result, the skb->pfmemalloc is always true unless page_frag_cache->va is
      re-allocated, even if the kernel is not under memory pressure any longer.
      
      Here is how kernel runs into issue.
      
      1. The kernel is under memory pressure and allocation of
      PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,
      the pfmemalloc page is allocated for page_frag_cache->va.
      
      2: All skb->data from page_frag_cache->va (pfmemalloc) will have
      skb->pfmemalloc=true. The skb will always be dropped by sock without
      SOCK_MEMALLOC. This is an expected behaviour.
      
      3. Suppose a large amount of pages are reclaimed and kernel is not under
      memory pressure any longer. We expect skb->pfmemalloc drop will not happen.
      
      4. Unfortunately, page_frag_alloc() does not proactively re-allocate
      page_frag_alloc->va and will always re-use the prior pfmemalloc page. The
      skb->pfmemalloc is always true even kernel is not under memory pressure any
      longer.
      
      Fix this by freeing and re-allocating the page instead of recycling it.
      
      References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/
      References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/
      
      
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
      Cc: Bert Barbe <bert.barbe@oracle.com>
      Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
      Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
      Cc: Manjunath Patil <manjunath.b.patil@oracle.com>
      Cc: Joe Jin <joe.jin@oracle.com>
      Cc: SRINIVAS <srinivas.eeda@oracle.com>
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201115201029.11903-1-dongli.zhang@oracle.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d8c19014
      History
      page_frag: Recover from memory pressure
      Dongli Zhang authored
      The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb().
      This ends up to page_frag_alloc() to allocate skb->data from
      page_frag_cache->va.
      
      During the memory pressure, page_frag_cache->va may be allocated as
      pfmemalloc page. As a result, the skb->pfmemalloc is always true as
      skb->data is from page_frag_cache->va. The skb will be dropped if the
      sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour
      under memory pressure.
      
      However, once kernel is not under memory pressure any longer (suppose large
      amount of memory pages are just reclaimed), the page_frag_alloc() may still
      re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a
      result, the skb->pfmemalloc is always true unless page_frag_cache->va is
      re-allocated, even if the kernel is not under memory pressure any longer.
      
      Here is how kernel runs into issue.
      
      1. The kernel is under memory pressure and allocation of
      PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,
      the pfmemalloc page is allocated for page_frag_cache->va.
      
      2: All skb->data from page_frag_cache->va (pfmemalloc) will have
      skb->pfmemalloc=true. The skb will always be dropped by sock without
      SOCK_MEMALLOC. This is an expected behaviour.
      
      3. Suppose a large amount of pages are reclaimed and kernel is not under
      memory pressure any longer. We expect skb->pfmemalloc drop will not happen.
      
      4. Unfortunately, page_frag_alloc() does not proactively re-allocate
      page_frag_alloc->va and will always re-use the prior pfmemalloc page. The
      skb->pfmemalloc is always true even kernel is not under memory pressure any
      longer.
      
      Fix this by freeing and re-allocating the page instead of recycling it.
      
      References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/
      References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/
      
      
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
      Cc: Bert Barbe <bert.barbe@oracle.com>
      Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
      Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
      Cc: Manjunath Patil <manjunath.b.patil@oracle.com>
      Cc: Joe Jin <joe.jin@oracle.com>
      Cc: SRINIVAS <srinivas.eeda@oracle.com>
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201115201029.11903-1-dongli.zhang@oracle.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
    bitops.h 9.92 KiB
    /* SPDX-License-Identifier: GPL-2.0 */
    #ifndef _ASM_X86_BITOPS_H
    #define _ASM_X86_BITOPS_H
    
    /*
     * Copyright 1992, Linus Torvalds.
     *
     * Note: inlines with more than a single statement should be marked
     * __always_inline to avoid problems with older gcc's inlining heuristics.
     */
    
    #ifndef _LINUX_BITOPS_H
    #error only <linux/bitops.h> can be included directly
    #endif
    
    #include <linux/compiler.h>
    #include <asm/alternative.h>
    #include <asm/rmwcc.h>
    #include <asm/barrier.h>
    
    #if BITS_PER_LONG == 32
    # define _BITOPS_LONG_SHIFT 5
    #elif BITS_PER_LONG == 64
    # define _BITOPS_LONG_SHIFT 6
    #else
    # error "Unexpected BITS_PER_LONG"
    #endif
    
    #define BIT_64(n)			(U64_C(1) << (n))
    
    /*
     * These have to be done with inline assembly: that way the bit-setting
     * is guaranteed to be atomic. All bit operations return 0 if the bit
     * was cleared before the operation and != 0 if it was not.
     *
     * bit 0 is the LSB of addr; bit 32 is the LSB of (addr+1).
     */
    
    #define RLONG_ADDR(x)			 "m" (*(volatile long *) (x))
    #define WBYTE_ADDR(x)			"+m" (*(volatile char *) (x))
    
    #define ADDR				RLONG_ADDR(addr)
    
    /*
     * We do the locked ops that don't return the old value as
     * a mask operation on a byte.
     */
    #define CONST_MASK_ADDR(nr, addr)	WBYTE_ADDR((void *)(addr) + ((nr)>>3))
    #define CONST_MASK(nr)			(1 << ((nr) & 7))
    
    static __always_inline void
    arch_set_bit(long nr, volatile unsigned long *addr)
    {
    	if (__builtin_constant_p(nr)) {
    		asm volatile(LOCK_PREFIX "orb %1,%0"
    			: CONST_MASK_ADDR(nr, addr)
    			: "iq" (CONST_MASK(nr) & 0xff)
    			: "memory");
    	} else {
    		asm volatile(LOCK_PREFIX __ASM_SIZE(bts) " %1,%0"
    			: : RLONG_ADDR(addr), "Ir" (nr) : "memory");
    	}
    }
    
    static __always_inline void
    arch___set_bit(long nr, volatile unsigned long *addr)
    {
    	asm volatile(__ASM_SIZE(bts) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
    }
    
    static __always_inline void
    arch_clear_bit(long nr, volatile unsigned long *addr)
    {
    	if (__builtin_constant_p(nr)) {
    		asm volatile(LOCK_PREFIX "andb %1,%0"
    			: CONST_MASK_ADDR(nr, addr)
    			: "iq" (CONST_MASK(nr) ^ 0xff));
    	} else {
    		asm volatile(LOCK_PREFIX __ASM_SIZE(btr) " %1,%0"
    			: : RLONG_ADDR(addr), "Ir" (nr) : "memory");
    	}
    }
    
    static __always_inline void
    arch_clear_bit_unlock(long nr, volatile unsigned long *addr)
    {
    	barrier();
    	arch_clear_bit(nr, addr);
    }
    
    static __always_inline void
    arch___clear_bit(long nr, volatile unsigned long *addr)
    {
    	asm volatile(__ASM_SIZE(btr) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
    }
    
    static __always_inline bool
    arch_clear_bit_unlock_is_negative_byte(long nr, volatile unsigned long *addr)
    {
    	bool negative;
    	asm volatile(LOCK_PREFIX "andb %2,%1"
    		CC_SET(s)
    		: CC_OUT(s) (negative), WBYTE_ADDR(addr)
    		: "ir" ((char) ~(1 << nr)) : "memory");
    	return negative;
    }
    #define arch_clear_bit_unlock_is_negative_byte                                 \
    	arch_clear_bit_unlock_is_negative_byte
    
    static __always_inline void
    arch___clear_bit_unlock(long nr, volatile unsigned long *addr)
    {
    	arch___clear_bit(nr, addr);
    }
    
    static __always_inline void
    arch___change_bit(long nr, volatile unsigned long *addr)
    {
    	asm volatile(__ASM_SIZE(btc) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
    }
    
    static __always_inline void
    arch_change_bit(long nr, volatile unsigned long *addr)
    {
    	if (__builtin_constant_p(nr)) {
    		asm volatile(LOCK_PREFIX "xorb %1,%0"
    			: CONST_MASK_ADDR(nr, addr)
    			: "iq" ((u8)CONST_MASK(nr)));
    	} else {
    		asm volatile(LOCK_PREFIX __ASM_SIZE(btc) " %1,%0"
    			: : RLONG_ADDR(addr), "Ir" (nr) : "memory");
    	}
    }
    
    static __always_inline bool
    arch_test_and_set_bit(long nr, volatile unsigned long *addr)
    {
    	return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(bts), *addr, c, "Ir", nr);
    }
    
    static __always_inline bool
    arch_test_and_set_bit_lock(long nr, volatile unsigned long *addr)
    {
    	return arch_test_and_set_bit(nr, addr);
    }
    
    static __always_inline bool
    arch___test_and_set_bit(long nr, volatile unsigned long *addr)
    {
    	bool oldbit;
    
    	asm(__ASM_SIZE(bts) " %2,%1"
    	    CC_SET(c)
    	    : CC_OUT(c) (oldbit)
    	    : ADDR, "Ir" (nr) : "memory");
    	return oldbit;
    }
    
    static __always_inline bool
    arch_test_and_clear_bit(long nr, volatile unsigned long *addr)
    {
    	return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btr), *addr, c, "Ir", nr);
    }
    
    /*
     * Note: the operation is performed atomically with respect to
     * the local CPU, but not other CPUs. Portable code should not
     * rely on this behaviour.
     * KVM relies on this behaviour on x86 for modifying memory that is also
     * accessed from a hypervisor on the same CPU if running in a VM: don't change
     * this without also updating arch/x86/kernel/kvm.c
     */
    static __always_inline bool
    arch___test_and_clear_bit(long nr, volatile unsigned long *addr)
    {
    	bool oldbit;
    
    	asm volatile(__ASM_SIZE(btr) " %2,%1"
    		     CC_SET(c)
    		     : CC_OUT(c) (oldbit)
    		     : ADDR, "Ir" (nr) : "memory");
    	return oldbit;
    }
    
    static __always_inline bool
    arch___test_and_change_bit(long nr, volatile unsigned long *addr)
    {
    	bool oldbit;
    
    	asm volatile(__ASM_SIZE(btc) " %2,%1"
    		     CC_SET(c)
    		     : CC_OUT(c) (oldbit)
    		     : ADDR, "Ir" (nr) : "memory");
    
    	return oldbit;
    }
    
    static __always_inline bool
    arch_test_and_change_bit(long nr, volatile unsigned long *addr)
    {
    	return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btc), *addr, c, "Ir", nr);
    }
    
    static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
    {
    	return ((1UL << (nr & (BITS_PER_LONG-1))) &
    		(addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
    }
    
    static __always_inline bool variable_test_bit(long nr, volatile const unsigned long *addr)
    {
    	bool oldbit;
    
    	asm volatile(__ASM_SIZE(bt) " %2,%1"
    		     CC_SET(c)
    		     : CC_OUT(c) (oldbit)
    		     : "m" (*(unsigned long *)addr), "Ir" (nr) : "memory");
    
    	return oldbit;
    }
    
    #define arch_test_bit(nr, addr)			\
    	(__builtin_constant_p((nr))		\
    	 ? constant_test_bit((nr), (addr))	\
    	 : variable_test_bit((nr), (addr)))
    
    /**
     * __ffs - find first set bit in word
     * @word: The word to search
     *
     * Undefined if no bit exists, so code should check against 0 first.
     */
    static __always_inline unsigned long __ffs(unsigned long word)
    {
    	asm("rep; bsf %1,%0"
    		: "=r" (word)
    		: "rm" (word));
    	return word;
    }
    
    /**
     * ffz - find first zero bit in word
     * @word: The word to search
     *
     * Undefined if no zero exists, so code should check against ~0UL first.
     */
    static __always_inline unsigned long ffz(unsigned long word)
    {
    	asm("rep; bsf %1,%0"
    		: "=r" (word)
    		: "r" (~word));
    	return word;
    }
    
    /*
     * __fls: find last set bit in word
     * @word: The word to search
     *
     * Undefined if no set bit exists, so code should check against 0 first.
     */
    static __always_inline unsigned long __fls(unsigned long word)
    {
    	asm("bsr %1,%0"
    	    : "=r" (word)
    	    : "rm" (word));
    	return word;
    }
    
    #undef ADDR
    
    #ifdef __KERNEL__
    /**
     * ffs - find first set bit in word
     * @x: the word to search
     *
     * This is defined the same way as the libc and compiler builtin ffs
     * routines, therefore differs in spirit from the other bitops.
     *
     * ffs(value) returns 0 if value is 0 or the position of the first
     * set bit if value is nonzero. The first (least significant) bit
     * is at position 1.
     */
    static __always_inline int ffs(int x)
    {
    	int r;
    
    #ifdef CONFIG_X86_64
    	/*
    	 * AMD64 says BSFL won't clobber the dest reg if x==0; Intel64 says the
    	 * dest reg is undefined if x==0, but their CPU architect says its
    	 * value is written to set it to the same as before, except that the
    	 * top 32 bits will be cleared.
    	 *
    	 * We cannot do this on 32 bits because at the very least some
    	 * 486 CPUs did not behave this way.
    	 */
    	asm("bsfl %1,%0"
    	    : "=r" (r)
    	    : "rm" (x), "0" (-1));
    #elif defined(CONFIG_X86_CMOV)
    	asm("bsfl %1,%0\n\t"
    	    "cmovzl %2,%0"
    	    : "=&r" (r) : "rm" (x), "r" (-1));
    #else
    	asm("bsfl %1,%0\n\t"
    	    "jnz 1f\n\t"
    	    "movl $-1,%0\n"
    	    "1:" : "=r" (r) : "rm" (x));
    #endif
    	return r + 1;
    }
    
    /**
     * fls - find last set bit in word
     * @x: the word to search
     *
     * This is defined in a similar way as the libc and compiler builtin
     * ffs, but returns the position of the most significant set bit.
     *
     * fls(value) returns 0 if value is 0 or the position of the last
     * set bit if value is nonzero. The last (most significant) bit is
     * at position 32.
     */
    static __always_inline int fls(unsigned int x)
    {
    	int r;
    
    #ifdef CONFIG_X86_64
    	/*
    	 * AMD64 says BSRL won't clobber the dest reg if x==0; Intel64 says the
    	 * dest reg is undefined if x==0, but their CPU architect says its
    	 * value is written to set it to the same as before, except that the
    	 * top 32 bits will be cleared.
    	 *
    	 * We cannot do this on 32 bits because at the very least some
    	 * 486 CPUs did not behave this way.
    	 */
    	asm("bsrl %1,%0"
    	    : "=r" (r)
    	    : "rm" (x), "0" (-1));
    #elif defined(CONFIG_X86_CMOV)
    	asm("bsrl %1,%0\n\t"
    	    "cmovzl %2,%0"
    	    : "=&r" (r) : "rm" (x), "rm" (-1));
    #else
    	asm("bsrl %1,%0\n\t"
    	    "jnz 1f\n\t"
    	    "movl $-1,%0\n"
    	    "1:" : "=r" (r) : "rm" (x));
    #endif
    	return r + 1;
    }
    
    /**
     * fls64 - find last set bit in a 64-bit word
     * @x: the word to search
     *
     * This is defined in a similar way as the libc and compiler builtin
     * ffsll, but returns the position of the most significant set bit.
     *
     * fls64(value) returns 0 if value is 0 or the position of the last
     * set bit if value is nonzero. The last (most significant) bit is
     * at position 64.
     */
    #ifdef CONFIG_X86_64
    static __always_inline int fls64(__u64 x)
    {
    	int bitpos = -1;
    	/*
    	 * AMD64 says BSRQ won't clobber the dest reg if x==0; Intel64 says the
    	 * dest reg is undefined if x==0, but their CPU architect says its
    	 * value is written to set it to the same as before.
    	 */
    	asm("bsrq %1,%q0"
    	    : "+r" (bitpos)
    	    : "rm" (x));
    	return bitpos + 1;
    }
    #else
    #include <asm-generic/bitops/fls64.h>
    #endif
    
    #include <asm-generic/bitops/find.h>
    
    #include <asm-generic/bitops/sched.h>
    
    #include <asm/arch_hweight.h>
    
    #include <asm-generic/bitops/const_hweight.h>
    
    #include <asm-generic/bitops/instrumented-atomic.h>
    #include <asm-generic/bitops/instrumented-non-atomic.h>
    #include <asm-generic/bitops/instrumented-lock.h>
    
    #include <asm-generic/bitops/le.h>
    
    #include <asm-generic/bitops/ext2-atomic-setbit.h>
    
    #endif /* __KERNEL__ */
    #endif /* _ASM_X86_BITOPS_H */