Commit 63bef48f authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'akpm' (patches from Andrew)

Merge more updates from Andrew Morton:

 - a lot more of MM, quite a bit more yet to come: (memcg, pagemap,
   vmalloc, pagealloc, migration, thp, ksm, madvise, virtio,
   userfaultfd, memory-hotplug, shmem, rmap, zswap, zsmalloc, cleanups)

 - various other subsystems (procfs, misc, MAINTAINERS, bitops, lib,
   checkpatch, epoll, binfmt, kallsyms, reiserfs, kmod, gcov, kconfig,
   ubsan, fault-injection, ipc)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (158 commits)
  ipc/shm.c: make compat_ksys_shmctl() static
  ipc/mqueue.c: fix a brace coding style issue
  lib/Kconfig.debug: fix a typo "capabilitiy" -> "capability"
  ubsan: include bug type in report header
  kasan: unset panic_on_warn before calling panic()
  ubsan: check panic_on_warn
  drivers/misc/lkdtm/bugs.c: add arithmetic overflow and array bounds checks
  ubsan: split "bounds" checker from other options
  ubsan: add trap instrumentation option
  init/Kconfig: clean up ANON_INODES and old IO schedulers options
  kernel/gcov/fs.c: replace zero-length array with flexible-array member
  gcov: gcc_3_4: replace zero-length array with flexible-array member
  gcov: gcc_4_7: replace zero-length array with flexible-array member
  kernel/kmod.c: fix a typo "assuems" -> "assumes"
  reiserfs: clean up several indentation issues
  kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()
  samples/hw_breakpoint: drop use of kallsyms_lookup_name()
  samples/hw_breakpoint: drop HW_BREAKPOINT_R when reporting writes
  fs/binfmt_elf.c: don't free interpreter's ELF pheaders on common path
  fs/binfmt_elf.c: allocate less for static executable
  ...
parents 04de788e 1cd377ba
......@@ -2573,13 +2573,22 @@
For details see: Documentation/admin-guide/hw-vuln/mds.rst
mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory
Amount of memory to be used when the kernel is not able
to see the whole system memory or for test.
Amount of memory to be used in cases as follows:
1 for test;
2 when the kernel is not able to see the whole system memory;
3 memory that lies after 'mem=' boundary is excluded from
the hypervisor, then assigned to KVM guests.
[X86] Work as limiting max address. Use together
with memmap= to avoid physical address space collisions.
Without memmap= PCI devices could be placed at addresses
belonging to unused RAM.
Note that this only takes effects during boot time since
in above case 3, memory may need be hot added after boot
if system memory of hypervisor is not sufficient.
mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel
memory.
......
......@@ -310,6 +310,11 @@ thp_fault_fallback
is incremented if a page fault fails to allocate
a huge page and instead falls back to using small pages.
thp_fault_fallback_charge
is incremented if a page fault fails to charge a huge page and
instead falls back to using small pages even though the
allocation was successful.
thp_collapse_alloc_failed
is incremented if khugepaged found a range
of pages that should be collapsed into one huge page but failed
......@@ -319,6 +324,15 @@ thp_file_alloc
is incremented every time a file huge page is successfully
allocated.
thp_file_fallback
is incremented if a file huge page is attempted to be allocated
but fails and instead falls back to using small pages.
thp_file_fallback_charge
is incremented if a file huge page cannot be charged and instead
falls back to using small pages even though the allocation was
successful.
thp_file_mapped
is incremented every time a file huge page is mapped into
user address space.
......
......@@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
half copied page since it'll keep userfaulting until the copy has
finished.
Notes:
- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
you must provide some kind of page in your thread after reading from
the uffd. You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
The normal behavior of the OS automatically providing a zero page on
an annonymous mmaping is not in place.
- None of the page-delivering ioctls default to the range that you
registered with. You must fill in all fields for the appropriate
ioctl struct including the range.
- You get the address of the access that triggered the missing page
event out of a struct uffd_msg that you read in the thread from the
uffd. You can supply as many pages as you want with UFFDIO_COPY or
UFFDIO_ZEROPAGE. Keep in mind that unless you used DONTWAKE then
the first of any of those IOCTLs wakes up the faulting thread.
- Be sure to test for all errors including (pollfd[0].revents &
POLLERR). This can happen, e.g. when ranges supplied were
incorrect.
Write Protect Notifications
---------------------------
This is equivalent to (but faster than) using mprotect and a SIGSEGV
signal handler.
Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
in the struct passed in. The range does not default to and does not
have to be identical to the range you registered with. You can write
protect as many ranges as you like (inside the registered range).
Then, in the thread reading from uffd the struct will have
msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
This wakes up the thread which will continue to run with writes. This
allows you to do the bookkeeping about the write in the uffd reading
thread before the ioctl.
If you registered with both UFFDIO_REGISTER_MODE_MISSING and
UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
which you supply a page and undo write protect. Note that there is a
difference between writes into a WP area and into a !WP area. The
former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
UFFD_PAGEFAULT_FLAG_WRITE. The latter did not fail on protection but
you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
used.
QEMU/KVM
========
......
.. _free_page_reporting:
=====================
Free Page Reporting
=====================
Free page reporting is an API by which a device can register to receive
lists of pages that are currently unused by the system. This is useful in
the case of virtualization where a guest is then able to use this data to
notify the hypervisor that it is no longer using certain pages in memory.
For the driver, typically a balloon driver, to use of this functionality
it will allocate and initialize a page_reporting_dev_info structure. The
field within the structure it will populate is the "report" function
pointer used to process the scatterlist. It must also guarantee that it can
handle at least PAGE_REPORTING_CAPACITY worth of scatterlist entries per
call to the function. A call to page_reporting_register will register the
page reporting interface with the reporting framework assuming no other
page reporting devices are already registered.
Once registered the page reporting API will begin reporting batches of
pages to the driver. The API will start reporting pages 2 seconds after
the interface is registered and will continue to do so 2 seconds after any
page of a sufficiently high order is freed.
Pages reported will be stored in the scatterlist passed to the reporting
function with the final entry having the end bit set in entry nent - 1.
While pages are being processed by the report function they will not be
accessible to the allocator. Once the report function has been completed
the pages will be returned to the free area from which they were obtained.
Prior to removing a driver that is making use of free page reporting it
is necessary to call page_reporting_unregister to have the
page_reporting_dev_info structure that is currently in use by free page
reporting removed. Doing this will prevent further reports from being
issued via the interface. If another driver or the same driver is
registered it is possible for it to resume where the previous driver had
left off in terms of reporting free pages.
Alexander Duyck, Dec 04, 2019
......@@ -35,9 +35,11 @@ Zswap evicts pages from compressed cache on an LRU basis to the backing swap
device when the compressed pool reaches its size limit. This requirement had
been identified in prior community discussions.
Zswap is disabled by default but can be enabled at boot time by setting
the ``enabled`` attribute to 1 at boot time. ie: ``zswap.enabled=1``. Zswap
can also be enabled and disabled at runtime using the sysfs interface.
Whether Zswap is enabled at the boot time depends on whether
the ``CONFIG_ZSWAP_DEFAULT_ON`` Kconfig option is enabled or not.
This setting can then be overridden by providing the kernel command line
``zswap.enabled=`` option, for example ``zswap.enabled=0``.
Zswap can also be enabled and disabled at runtime using the sysfs interface.
An example command to enable zswap at runtime, assuming sysfs is mounted
at ``/sys``, is::
......@@ -64,9 +66,10 @@ allocation in zpool is not directly accessible by address. Rather, a handle is
returned by the allocation routine and that handle must be mapped before being
accessed. The compressed memory pool grows on demand and shrinks as compressed
pages are freed. The pool is not preallocated. By default, a zpool
of type zbud is created, but it can be selected at boot time by
setting the ``zpool`` attribute, e.g. ``zswap.zpool=zbud``. It can
also be changed at runtime using the sysfs ``zpool`` attribute, e.g.::
of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created,
but it can be overridden at boot time by setting the ``zpool`` attribute,
e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs
``zpool`` attribute, e.g.::
echo zbud > /sys/module/zswap/parameters/zpool
......@@ -97,8 +100,9 @@ controlled policy:
* max_pool_percent - The maximum percentage of memory that the compressed
pool can occupy.
The default compressor is lzo, but it can be selected at boot time by
setting the ``compressor`` attribute, e.g. ``zswap.compressor=lzo``.
The default compressor is selected in ``CONFIG_ZSWAP_COMPRESSOR_DEFAULT``
Kconfig option, but it can be overridden at boot time by setting the
``compressor`` attribute, e.g. ``zswap.compressor=lzo``.
It can also be changed at runtime using the sysfs "compressor"
attribute, e.g.::
......
......@@ -77,21 +77,13 @@ Tips for patch submitters
8. Happy hacking.
Descriptions of section entries
-------------------------------
Descriptions of section entries and preferred order
---------------------------------------------------
M: *Mail* patches to: FullName <address@domain>
R: Designated *Reviewer*: FullName <address@domain>
These reviewers should be CCed on patches.
L: *Mailing list* that is relevant to this area
W: *Web-page* with status/info
B: URI for where to file *bugs*. A web-page with detailed bug
filing info, a direct bug tracker link, or a mailto: URI.
C: URI for *chat* protocol, server and channel where developers
usually hang out, for example irc://server/channel.
Q: *Patchwork* web based patch tracking system site
T: *SCM* tree type and location.
Type is one of: git, hg, quilt, stgit, topgit
S: *Status*, one of the following:
Supported: Someone is actually paid to look after this.
Maintained: Someone actually looks after it.
......@@ -102,30 +94,39 @@ Descriptions of section entries
Obsolete: Old code. Something tagged obsolete generally means
it has been replaced by a better system and you
should be using that.
W: *Web-page* with status/info
Q: *Patchwork* web based patch tracking system site
B: URI for where to file *bugs*. A web-page with detailed bug
filing info, a direct bug tracker link, or a mailto: URI.
C: URI for *chat* protocol, server and channel where developers
usually hang out, for example irc://server/channel.
P: Subsystem Profile document for more details submitting
patches to the given subsystem. This is either an in-tree file,
or a URI. See Documentation/maintainer/maintainer-entry-profile.rst
for details.
T: *SCM* tree type and location.
Type is one of: git, hg, quilt, stgit, topgit
F: *Files* and directories wildcard patterns.
A trailing slash includes all files and subdirectory files.
F: drivers/net/ all files in and below drivers/net
F: drivers/net/* all files in drivers/net, but not below
F: */net/* all files in "any top level directory"/net
One pattern per line. Multiple F: lines acceptable.
X: *Excluded* files and directories that are NOT maintained, same
rules as F:. Files exclusions are tested before file matches.
Can be useful for excluding a specific subdirectory, for instance:
F: net/
X: net/ipv6/
matches all files in and below net excluding net/ipv6/
N: Files and directories *Regex* patterns.
N: [^a-z]tegra all files whose path contains the word tegra
N: [^a-z]tegra all files whose path contains tegra
(not including files like integrator)
One pattern per line. Multiple N: lines acceptable.
scripts/get_maintainer.pl has different behavior for files that
match F: pattern and matches of N: patterns. By default,
get_maintainer will not look at git log history when an F: pattern
match occurs. When an N: match occurs, git log history is used
to also notify the people that have git commit signatures.
X: *Excluded* files and directories that are NOT maintained, same
rules as F:. Files exclusions are tested before file matches.
Can be useful for excluding a specific subdirectory, for instance:
F: net/
X: net/ipv6/
matches all files in and below net excluding net/ipv6/
K: *Content regex* (perl extended) pattern match in a patch or file.
For instance:
K: of_get_profile
......
......@@ -8,8 +8,6 @@
#include <asm/smp.h>
struct bootmem_data_t; /* stupid forward decl. */
/*
* Following are macros that are specific to this numa platform.
*/
......
......@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | (
printf "#define __NR_syscalls\t%s\n" "${nxt}"
printf "#endif\n"
printf "\n"
printf "#endif /* %s */" "${fileguard}"
printf "#endif /* %s */\n" "${fileguard}"
) > "$out"
......@@ -141,7 +141,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
} else {
if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
if (unlikely(!vma_is_accessible(vma)))
goto bad_area;
}
......
......@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | (
printf "#define __NR_syscalls\t%s\n" "${nxt}"
printf "#endif\n"
printf "\n"
printf "#endif /* %s */" "${fileguard}"
printf "#endif /* %s */\n" "${fileguard}"
) > "$out"
......@@ -54,6 +54,8 @@ SECTIONS {
CPUIDLE_TEXT
LOCK_TEXT
KPROBES_TEXT
IRQENTRY_TEXT
SOFTIRQENTRY_TEXT
*(.gnu.linkonce.t*)
}
......
......@@ -125,7 +125,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
case 1: /* read, present */
goto acc_err;
case 0: /* read, not present */
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
if (unlikely(!vma_is_accessible(vma)))
goto acc_err;
}
......
......@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | (
printf "#define __NR_syscalls\t%s\n" "${nxt}"
printf "#endif\n"
printf "\n"
printf "#endif /* %s */" "${fileguard}"
printf "#endif /* %s */\n" "${fileguard}"
) > "$out"
......@@ -32,6 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | (
printf "#define __NR_syscalls\t%s\n" "${nxt}"
printf "#endif\n"
printf "\n"
printf "#endif /* %s */" "${fileguard}"
printf "\n"
printf "#endif /* %s */\n" "${fileguard}"
) > "$out"
......@@ -142,7 +142,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
goto bad_area;
}
} else {
if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
if (unlikely(!vma_is_accessible(vma)))
goto bad_area;
}
}
......
......@@ -47,6 +47,7 @@ SECTIONS
LOCK_TEXT
KPROBES_TEXT
IRQENTRY_TEXT
SOFTIRQENTRY_TEXT
*(.fixup)
}
......
......@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | (
printf "#define __NR_syscalls\t%s\n" "${nxt}"
printf "#endif\n"
printf "\n"
printf "#endif /* %s */" "${fileguard}"
printf "#endif /* %s */\n" "${fileguard}"
) > "$out"
......@@ -32,6 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | (
printf "#define __NR_syscalls\t%s\n" "${nxt}"
printf "#endif\n"
printf "\n"
printf "#endif /* %s */" "${fileguard}"
printf "\n"
printf "#endif /* %s */\n" "${fileguard}"
) > "$out"
......@@ -422,7 +422,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
break;
}
} else if (vma && hva >= vma->vm_start &&
(vma->vm_flags & VM_HUGETLB)) {
is_vm_hugetlb_page(vma)) {
unsigned long psize = vma_kernel_pagesize(vma);
tsize = (gtlbe->mas1 & MAS1_TSIZE_MASK) >>
......
......@@ -314,7 +314,7 @@ static bool access_error(bool is_write, bool is_exec,
return false;
}
if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
if (unlikely(!vma_is_accessible(vma)))
return true;
/*
* We should ideally do the vma pkey access check here. But in the
......
......@@ -231,16 +231,10 @@ static int memtrace_online(void)
continue;
}
/*
* If kernel isn't compiled with the auto online option
* we need to online the memory ourselves.
*/
if (!memhp_auto_online) {
lock_device_hotplug();
walk_memory_blocks(ent->start, ent->size, NULL,
online_mem_block);
unlock_device_hotplug();
}
lock_device_hotplug();
walk_memory_blocks(ent->start, ent->size, NULL,
online_mem_block);
unlock_device_hotplug();
/*
* Memory was added successfully so clean up references to it
......
......@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | (
printf "#define __NR_syscalls\t%s\n" "${nxt}"
printf "#endif\n"
printf "\n"
printf "#endif /* %s */" "${fileguard}"
printf "#endif /* %s */\n" "${fileguard}"
) > "$out"
......@@ -355,7 +355,7 @@ static inline int access_error(int error_code, struct vm_area_struct *vma)
return 1;
/* read, not present: */
if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
if (unlikely(!vma_is_accessible(vma)))
return 1;
return 0;
......
......@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | (
printf "#define __NR_syscalls\t%s\n" "${nxt}"
printf "#endif\n"
printf "\n"
printf "#endif /* %s */" "${fileguard}"
printf "#endif /* %s */\n" "${fileguard}"
) > "$out"
......@@ -4,10 +4,6 @@
#define BUILD_VDSO32
#ifndef CONFIG_CC_OPTIMIZE_FOR_SIZE
#undef CONFIG_OPTIMIZE_INLINING
#endif
#ifdef CONFIG_SPARC64
/*
......
......@@ -149,6 +149,7 @@ config X86
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
select HAVE_ARCH_USERFAULTFD_WP if USERFAULTFD
select HAVE_ARCH_VMAP_STACK if X86_64
select HAVE_ARCH_WITHIN_STACK_FRAMES
select HAVE_ASM_MODVERSIONS
......
......@@ -285,7 +285,6 @@ CONFIG_EARLY_PRINTK_DBGP=y
CONFIG_DEBUG_STACKOVERFLOW=y
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_DEBUG_BOOT_PARAMS=y
CONFIG_OPTIMIZE_INLINING=y
CONFIG_SECURITY=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_SELINUX=y
......
......@@ -282,7 +282,6 @@ CONFIG_EARLY_PRINTK_DBGP=y
CONFIG_DEBUG_STACKOVERFLOW=y
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_DEBUG_BOOT_PARAMS=y
CONFIG_OPTIMIZE_INLINING=y
CONFIG_UNWINDER_ORC=y
CONFIG_SECURITY=y
CONFIG_SECURITY_NETWORK=y
......
// SPDX-License-Identifier: GPL-2.0
#define BUILD_VDSO32
#ifndef CONFIG_CC_OPTIMIZE_FOR_SIZE
#undef CONFIG_OPTIMIZE_INLINING
#endif
#ifdef CONFIG_X86_64
/*
......
......@@ -25,6 +25,7 @@
#include <asm/x86_init.h>
#include <asm/fpu/xstate.h>
#include <asm/fpu/api.h>
#include <asm-generic/pgtable_uffd.h>
extern pgd_t early_top_pgt[PTRS_PER_PGD];
int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
......@@ -313,6 +314,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
return native_make_pte(v & ~clear);
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pte_uffd_wp(pte_t pte)
{
return pte_flags(pte) & _PAGE_UFFD_WP;
}
static inline pte_t pte_mkuffd_wp(pte_t pte)
{
return pte_set_flags(pte, _PAGE_UFFD_WP);
}
static inline pte_t pte_clear_uffd_wp(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_UFFD_WP);
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
static inline pte_t pte_mkclean(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_DIRTY);
......@@ -392,6 +410,23 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
return native_make_pmd(v & ~clear);
}
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pmd_uffd_wp(pmd_t pmd)
{
return pmd_flags(pmd) & _PAGE_UFFD_WP;
}
static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
{
return pmd_set_flags(pmd, _PAGE_UFFD_WP);
}
static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
static inline pmd_t pmd_mkold(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_ACCESSED);
......@@ -1374,6 +1409,38 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
#endif
#endif
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
{
return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
}
static inline int pte_swp_uffd_wp(pte_t pte)
{
return pte_flags(pte) & _PAGE_SWP_UFFD_WP;
}
static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
}
static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
{
return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP);
}
static inline int pmd_swp_uffd_wp(pmd_t pmd)
{
return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP;
}
static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP);
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
#define PKRU_AD_BIT 0x1
#define PKRU_WD_BIT 0x2
#define PKRU_BITS_PER_PKEY 2
......
......@@ -189,7 +189,7 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
*
* | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number
* | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
* | TYPE (59-63) | ~OFFSET (9-58) |0|0|X|X| X| X|X|SD|0| <- swp entry
* | TYPE (59-63) | ~OFFSET (9-58) |0|0|X|X| X| X|F|SD|0| <- swp entry
*
* G (8) is aliased and used as a PROT_NONE indicator for
* !present ptes. We need to start storing swap entries above
......@@ -197,9 +197,15 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
* erratum where they can be incorrectly set by hardware on
* non-present PTEs.
*
* SD Bits 1-4 are not used in non-present format and available for
* special use described below:
*
* SD (1) in swp entry is used to store soft dirty bit, which helps us
* remember soft dirty over page migration
*
* F (2) in swp entry is used to record when a pagetable is
* writeprotected by userfaultfd WP support.
*
* Bit 7 in swp entry should be 0 because pmd_present checks not only P,
* but also L and G.
*
......
......@@ -32,6 +32,7 @@
#define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1
#define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
#define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
......@@ -100,6 +101,14 @@
#define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0))
#endif
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
#define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
#define _PAGE_SWP_UFFD_WP _PAGE_USER
#else
#define _PAGE_UFFD_WP (_AT(pteval_t, 0))
#define _PAGE_SWP_UFFD_WP (_AT(pteval_t, 0))
#endif
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
#define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX)
#define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
......@@ -118,7 +127,8 @@
*/
#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
_PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC)
_PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC | \
_PAGE_UFFD_WP)
#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)