Commit e900a918 authored by Dan Williams's avatar Dan Williams Committed by Linus Torvalds
Browse files

mm: shuffle initial free memory to improve memory-side-cache utilization

Patch series "mm: Randomize free memory", v10.

This patch (of 3):

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache.  Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms.  In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM.  Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:

    A kernel module called zonesort is available to try to help:

    and this abandoned patch series proposed that for the kernel:

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03
("x86/numa_emulation: Introduce uniform split capability").  With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes.  A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable.  While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts.  Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
   3X speedup in a contrived case that tries to force cache conflicts.
   The contrived cased used the numa_emulation capability to force an
   instance of the benchmark to be run in two of the near-memory sized
   numa nodes.  If both instances were placed on the same emulated they
   would fit and cause zero conflicts.  While on separate emulated nodes
   without randomization they underutilized the cache and conflicted
   unnecessarily due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
   size that exceeded cache size by 3X.  The cache conflict rate was 8%
   for the first run and degraded to 21% after page allocator aging.  With
   randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
   cache-conflict rates, but the overall throughput dropped by 7% with
   randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
   enabled on platforms without a memory-side-cache and saw a mix of some
   improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads.  For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected.  Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization.  Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects.  The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages.  Pages are freed in physical address order when first

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time.  Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.  10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.  The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.

This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions.  It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages.  At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent.  As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.


[ fix shuffle enable]
[ fix SHUFFLE_PAGE_ALLOCATOR help texts]

Signed-off-by: default avatarDan Williams <>
Signed-off-by: default avatarQian Cai <>
Reviewed-by: default avatarKees Cook <>
Acked-by: default avatarMichal Hocko <>
Cc: Dave Hansen <>
Cc: Keith Busch <>
Cc: Robert Elliott <>
Signed-off-by: default avatarAndrew Morton <>
Signed-off-by: default avatarLinus Torvalds <>
parent 4d36e6f8
......@@ -3174,6 +3174,16 @@
This will also cause panics on machine check exceptions.
Useful together with panic=30 to trigger a reboot.
[KNL] Boolean flag to control whether the page allocator
should randomize its free lists. The randomization may
be automatically enabled if the kernel detects it is
running on a platform with a direct-mapped memory-side
cache, and this parameter can be used to
override/disable that behavior. The state of the flag
can be read from sysfs at:
page_owner= [KNL] Boot-time page_owner enabling option.
Storage of the information about who allocated
each page is disabled in default. With this switch,
......@@ -150,6 +150,23 @@ static inline void list_replace_init(struct list_head *old,
* list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position
* @entry1: the location to place entry2
* @entry2: the location to place entry1
static inline void list_swap(struct list_head *entry1,
struct list_head *entry2)
struct list_head *pos = entry2->prev;
list_replace(entry1, entry2);
if (pos == entry1)
pos = entry2;
list_add(entry1, pos);
* list_del_init - deletes entry from list and reinitialize it.
* @entry: the element to delete from the list.
......@@ -1271,6 +1271,7 @@ void sparse_init(void);
#define sparse_init() do {} while (0)
#define sparse_index_init(_sec, _nid) do {} while (0)
#define pfn_present pfn_valid
......@@ -1752,6 +1752,30 @@ config SLAB_FREELIST_HARDENED
sacrifies to harden the kernel slab allocator against common
freelist exploit methods.
bool "Page allocator randomization"
Randomization of the page allocator improves the average
utilization of a direct-mapped memory-side-cache. See section
5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
6.2a specification for an example of how a platform advertises
the presence of a memory-side-cache. There are also incidental
security benefits as it reduces the predictability of page
allocations to compliment SLAB_FREELIST_RANDOM, but the
default granularity of shuffling on the "MAX_ORDER - 1" i.e,
10th order of pages is selected based on cache utilization
benefits on x86.
While the randomization improves cache utilization it may
negatively impact workloads on platforms without a cache. For
this reason, by default, the randomization is enabled only
after runtime detection of a direct-mapped memory-side-cache.
Otherwise, the randomization may be force enabled with the
'page_alloc.shuffle' kernel command line parameter.
Say Y if unsure.
default y
depends on SLUB && SMP
......@@ -33,7 +33,7 @@ mmu-$(CONFIG_MMU) += process_vm_access.o
obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
maccess.o page_alloc.o page-writeback.o \
maccess.o page-writeback.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o mmu_context.o percpu.o slab_common.o \
......@@ -41,6 +41,11 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
interval_tree.o list_lru.o workingset.o \
debug.o $(mmu-y)
# Give 'page_alloc' its own module-parameter namespace
page-alloc-y := page_alloc.o
page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
obj-y += page-alloc.o
obj-y += init-mm.o
obj-y += memblock.o
......@@ -39,6 +39,7 @@
#include <asm/tlbflush.h>
#include "internal.h"
#include "shuffle.h"
* online_page_callback contains pointer to current page onlining function.
......@@ -891,6 +892,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
zone->zone_pgdat->node_present_pages += onlined_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
if (onlined_pages) {
node_states_set_node(nid, &arg);
if (need_zonelists_rebuild)
......@@ -72,6 +72,7 @@
#include <asm/tlbflush.h>
#include <asm/div64.h>
#include "internal.h"
#include "shuffle.h"
/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
static DEFINE_MUTEX(pcp_batch_high_lock);
......@@ -1874,9 +1875,9 @@ _deferred_grow_zone(struct zone *zone, unsigned int order)
void __init page_alloc_init_late(void)
struct zone *zone;
int nid;
int nid;
/* There will be num_node_state(N_MEMORY) threads */
atomic_set(&pgdat_init_n_undone, num_node_state(N_MEMORY));
......@@ -1900,6 +1901,9 @@ void __init page_alloc_init_late(void)
/* Discard memblock private memory */
for_each_node_state(nid, N_MEMORY)
// SPDX-License-Identifier: GPL-2.0
// Copyright(c) 2018 Intel Corporation. All rights reserved.
#include <linux/mm.h>
#include <linux/init.h>
#include <linux/mmzone.h>
#include <linux/random.h>
#include <linux/moduleparam.h>
#include "internal.h"
#include "shuffle.h"
static unsigned long shuffle_state __ro_after_init;
* Depending on the architecture, module parameter parsing may run
* before, or after the cache detection. SHUFFLE_FORCE_DISABLE prevents,
* or reverts the enabling of the shuffle implementation. SHUFFLE_ENABLE
* attempts to turn on the implementation, but aborts if it finds
__meminit void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state);
if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) {
if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state))
} else if (ctl == SHUFFLE_ENABLE
&& !test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state))
static bool shuffle_param;
extern int shuffle_show(char *buffer, const struct kernel_param *kp)
return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state)
? 'Y' : 'N');
static __meminit int shuffle_store(const char *val,
const struct kernel_param *kp)
int rc = param_set_bool(val, kp);
if (rc < 0)
return rc;
if (shuffle_param)
return 0;
module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
* For two pages to be swapped in the shuffle, they must be free (on a
* 'free_area' lru), have the same order, and have the same migratetype.
static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
struct page *page;
* Given we're dealing with randomly selected pfns in a zone we
* need to ask questions like...
/* the pfn even in the memmap? */
if (!pfn_valid_within(pfn))
return NULL;
/* the pfn in a present section or a hole? */
if (!pfn_present(pfn))
return NULL;
/* the page free and currently on a free_area list? */
page = pfn_to_page(pfn);
if (!PageBuddy(page))
return NULL;
* the page on the same list as the page we will
* shuffle it with?
if (page_order(page) != order)
return NULL;
return page;
* Fisher-Yates shuffle the freelist which prescribes iterating through an
* array, pfns in this case, and randomly swapping each entry with another in
* the span, end_pfn - start_pfn.
* To keep the implementation simple it does not attempt to correct for sources
* of bias in the distribution, like modulo bias or pseudo-random number
* generator bias. I.e. the expectation is that this shuffling raises the bar
* for attacks that exploit the predictability of page allocations, but need not
* be a perfect shuffle.
#define SHUFFLE_RETRY 10
void __meminit __shuffle_zone(struct zone *z)
unsigned long i, flags;
unsigned long start_pfn = z->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(z);
const int order = SHUFFLE_ORDER;
const int order_pages = 1 << order;
spin_lock_irqsave(&z->lock, flags);
start_pfn = ALIGN(start_pfn, order_pages);
for (i = start_pfn; i < end_pfn; i += order_pages) {
unsigned long j;
int migratetype, retry;
struct page *page_i, *page_j;
* We expect page_i, in the sub-range of a zone being added
* (@start_pfn to @end_pfn), to more likely be valid compared to
* page_j randomly selected in the span @zone_start_pfn to
* @spanned_pages.
page_i = shuffle_valid_page(i, order);
if (!page_i)
for (retry = 0; retry < SHUFFLE_RETRY; retry++) {
* Pick a random order aligned page in the zone span as
* a swap target. If the selected pfn is a hole, retry
* up to SHUFFLE_RETRY attempts find a random valid pfn
* in the zone.
j = z->zone_start_pfn +
ALIGN_DOWN(get_random_long() % z->spanned_pages,
page_j = shuffle_valid_page(j, order);
if (page_j && page_j != page_i)
if (retry >= SHUFFLE_RETRY) {
pr_debug("%s: failed to swap %#lx\n", __func__, i);
* Each migratetype corresponds to its own list, make sure the
* types match otherwise we're moving pages to lists where they
* do not belong.
migratetype = get_pageblock_migratetype(page_i);
if (get_pageblock_migratetype(page_j) != migratetype) {
pr_debug("%s: migratetype mismatch %#lx\n", __func__, i);
list_swap(&page_i->lru, &page_j->lru);
pr_debug("%s: swap: %#lx -> %#lx\n", __func__, i, j);
/* take it easy on the zone lock */
if ((i % (100 * order_pages)) == 0) {
spin_unlock_irqrestore(&z->lock, flags);
spin_lock_irqsave(&z->lock, flags);
spin_unlock_irqrestore(&z->lock, flags);
* shuffle_free_memory - reduce the predictability of the page allocator
* @pgdat: node page data
void __meminit __shuffle_free_memory(pg_data_t *pgdat)
struct zone *z;
for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
// SPDX-License-Identifier: GPL-2.0
// Copyright(c) 2018 Intel Corporation. All rights reserved.
#ifndef _MM_SHUFFLE_H
#define _MM_SHUFFLE_H
#include <linux/jump_label.h>
* SHUFFLE_ENABLE is called from the command line enabling path, or by
* platform-firmware enabling that indicates the presence of a
* direct-mapped memory-side-cache. SHUFFLE_FORCE_DISABLE is called from
* the command line path and overrides any previous or future
enum mm_shuffle_ctl {
extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
extern void __shuffle_free_memory(pg_data_t *pgdat);
static inline void shuffle_free_memory(pg_data_t *pgdat)
if (!static_branch_unlikely(&page_alloc_shuffle_key))
extern void __shuffle_zone(struct zone *z);
static inline void shuffle_zone(struct zone *z)
if (!static_branch_unlikely(&page_alloc_shuffle_key))
static inline void shuffle_free_memory(pg_data_t *pgdat)
static inline void shuffle_zone(struct zone *z)
static inline void page_alloc_shuffle(enum mm_shuffle_ctl ctl)
#endif /* _MM_SHUFFLE_H */
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment