1. 02 Apr, 2020 40 commits
    • Stephen Rothwell's avatar
      17f166b5
    • Stephen Rothwell's avatar
      Merge branch 'akpm/master' · 83a73758
      Stephen Rothwell authored
      83a73758
    • Andi Kleen's avatar
      drivers/media/platform/sti/delta/delta-ipc.c: fix read buffer overflow · cd36aa87
      Andi Kleen authored
      The single caller passes a string to delta_ipc_open, which copies with a
      fixed size larger than the string.  So it copies some random data after
      the original string the ro segment.
      
      If the string was at the end of a page it may fault.
      
      Just copy the string with a normal strcpy after clearing the field.
      
      Found by a LTO build (which errors out)
      because the compiler inlines the functions and can resolve
      the string sizes and triggers the compile time checks in memcpy.
      
      In function `memcpy',
          inlined from `delta_ipc_open.constprop' at linux/drivers/media/platform/sti/delta/delta-ipc.c:178:0,
          inlined from `delta_mjpeg_ipc_open' at linux/drivers/media/platform/sti/delta/delta-mjpeg-dec.c:227:0,
          inlined from `delta_mjpeg_decode' at linux/drivers/media/platform/sti/delta/delta-mjpeg-dec.c:403:0:
      /home/andi/lsrc/linux/include/linux/string.h:337:0: error: call to `__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter
          __read_overflow2();
      
      Link: http://lkml.kernel.org/r/20171222001212.1850-1-andi@firstfloor.org
      
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Hugues FRUCHET <hugues.fruchet@st.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      cd36aa87
    • Vasily Averin's avatar
      ipc/util.c: sysvipc_find_ipc() should increase position index · 9eaa845d
      Vasily Averin authored
      If seq_file .next function does not change position index, read after some
      lseek can generate unexpected output.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=206283
      Link: http://lkml.kernel.org/r/b7a20945-e315-8bb0-21e6-3875c14a8494@virtuozzo.com
      
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      9eaa845d
    • Vasily Averin's avatar
      kernel/gcov/fs.c: gcov_seq_next() should increase position index · 758759df
      Vasily Averin authored
      If seq_file .next function does not change position index, read after some
      lseek can generate unexpected output.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=206283
      Link: http://lkml.kernel.org/r/f65c6ee7-bd00-f910-2f8a-37cc67e4ff88@virtuozzo.com
      
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarPeter Oberparleiter <oberpar@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      758759df
    • Andrew Morton's avatar
      seq_read-info-message-about-buggy-next-functions-fix · aa3340b9
      Andrew Morton authored
      
      
      s/pr_info/pr_info_ratelimited/, per Qian Cai
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      aa3340b9
    • Vasily Averin's avatar
      fs/seq_file.c: seq_read(): add info message about buggy .next functions · 59c1cf68
      Vasily Averin authored
      Patch series "seq_file .next functions should increase position index".
      
      In Aug 2018 NeilBrown noticed commit 1f4aace6 ("fs/seq_file.c:
      simplify seq_file iteration code and interface")
      
      "Some ->next functions do not increment *pos when they return NULL...
      Note that such ->next functions are buggy and should be fixed.  A simple
      demonstration is dd if=/proc/swaps bs=1000 skip=1 Choose any block size
      larger than the size of /proc/swaps.  This will always show the whole last
      line of /proc/swaps"
      
      Described problem is still actual.  If you make lseek into middle of last
      output line following read will output end of last line and whole last
      line once again.
      
      $ dd if=/proc/swaps bs=1  # usual output
      Filename				Type		Size	Used	Priority
      /dev/dm-0                               partition	4194812	97536	-2
      104+0 records in
      104+0 records out
      104 bytes copied
      
      $ dd if=/proc/swaps bs=40 skip=1    # last line was generated twice
      dd: /proc/swaps: cannot skip to specified offset
      v/dm-0                               partition	4194812	97536	-2
      /dev/dm-0                               partition	4194812	97536	-2
      3+1 records in
      3+1 records out
      131 bytes copied
      
      There are lot of other affected files, I've found 30+ including
      /proc/net/ip_tables_matches and /proc/sysvipc/*
      
      I've sent patches into maillists of affected subsystems already, this
      patch-set fixes the problem in files related to pstore, tracing, gcov,
      sysvipc and other subsystems processed via linux-kernel@ mailing list
      directly
      
      https://bugzilla.kernel.org/show_bug.cgi?id=206283
      
      This patch (of 4):
      
      Add debug code to seq_read() to detect missed or out-of-tree incorrect
      .next seq_file functions.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=206283
      Link: http://lkml.kernel.org/r/244674e5-760c-86bd-d08a-047042881748@virtuozzo.com
      Link: http://lkml.kernel.org/r/7c24087c-e280-e580-5b0c-0cdaeb14cd18@virtuozzo.com
      
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      59c1cf68
    • James Morse's avatar
      arm64: memory: give hotplug memory a different resource name · ba7e4aaf
      James Morse authored
      If kexec chooses to place the kernel in a memory region that was added
      after boot, we fail to boot as the kernel is running from a location that
      is not described as memory by the UEFI memory map or the original DT.
      
      To prevent unaware user-space kexec from doing this accidentally, give
      these regions a different name.
      
      Link: http://lkml.kernel.org/r/20200326180730.4754-4-james.morse@arm.com
      
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Bhupesh Sharma <bhsharma@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      ba7e4aaf
    • James Morse's avatar
      mm/memory_hotplug: allow arch override of non boot memory resource names · b2f535f4
      James Morse authored
      Memory added to the system by hotplug has a 'System RAM' resource created
      for it.  This is exposed to user-space via /proc/iomem.
      
      This poses problems for kexec on arm64.  If kexec decides to place the
      kernel in one of these newly onlined regions, the new kernel will find
      itself booting from a region not described as memory in the firmware
      tables.
      
      Arm64 doesn't have a structure like the e820 memory map that can be
      re-written when memory is brought online.  Instead arm64 uses the UEFI
      memory map, or the memory node from the DT, sometimes both.  We never
      rewrite these.
      
      Allow an architecture to specify a different name for these hotplug
      regions.
      
      Link: http://lkml.kernel.org/r/20200326180730.4754-3-james.morse@arm.com
      
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Bhupesh Sharma <bhsharma@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      b2f535f4
    • James Morse's avatar
      kexec: prevent removal of memory in use by a loaded kexec image · c579c851
      James Morse authored
      Patch series "kexec/memory_hotplug: Prevent removal and accidental use".
      
      arm64 recently queued support for memory hotremove, which led to some new
      corner cases for kexec.
      
      If the kexec segments are loaded for a removable region, that region may
      be removed before kexec actually occurs.  This causes the first kernel to
      lockup when applying the relocations.  (I've triggered this on x86 too).
      
      The first patch adds a memory notifier for kexec so that it can refuse to
      allow in-use regions to be taken offline.
      
      This doesn't solve the problem for arm64, where the new kernel must
      initially rely on the data structures from the first boot to describe
      memory.  These don't describe hotpluggable memory.  If kexec places the
      kernel in one of these regions, it must also provide a DT that describes
      the region in which the kernel was mapped as memory.  (and somehow ensure
      its always present in the future...)
      
      To prevent this from happening accidentally with unaware user-space,
      patches two and three allow arm64 to give these regions a different name.
      
      This is a change in behaviour for arm64 as memory hotadd and hotremove
      were added separately.
      
      I haven't tried kdump.  Unaware kdump from user-space probably won't
      describe the hotplug regions if the name is different, which saves us from
      problems if the memory is no longer present at kdump time, but means the
      vmcore is incomplete.
      
      This patch (of 3):
      
      An image loaded for kexec is not stored in place, instead its segments are
      scattered through memory, and are re-assembled when needed.  In the
      meantime, the target memory may have been removed.
      
      Because mm is not aware that this memory is still in use, it allows it
      to be removed.
      
      Add a memory notifier to prevent the removal of memory regions that
      overlap with a loaded kexec image segment.  e.g., when triggered from the
      Qemu console:
      
      | kexec_core: memory region in use
      | memory memory32: Offline failed.
      
      Link: http://lkml.kernel.org/r/20200326180730.4754-1-james.morse@arm.com
      Link: http://lkml.kernel.org/r/20200326180730.4754-2-james.morse@arm.com
      
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Bhupesh Sharma <bhsharma@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc; Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      c579c851
    • Eric Biggers's avatar
      selftests: kmod: test disabling module autoloading · 9ac5b46a
      Eric Biggers authored
      Test that request_module() fails with -ENOENT when
      /proc/sys/kernel/modprobe contains (a) a nonexistent path, and (b) an
      empty path.
      
      Case (b) is a regression test for the patch "kmod: make request_module()
      return an error when autoloading is disabled".
      
      Tested with 'kmod.sh -t 0010 && kmod.sh -t 0011', and also simply with
      'kmod.sh' to run all kmod tests.
      
      Link: http://lkml.kernel.org/r/20200312202552.241885-5-ebiggers@kernel.org
      
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      9ac5b46a
    • Eric Biggers's avatar
      selftests: kmod: fix handling test numbers above 9 · 45b9cf50
      Eric Biggers authored
      get_test_count() and get_test_enabled() were broken for test numbers above
      9 due to awk interpreting a field specification like '$0010' as octal
      rather than decimal.  Fix it by stripping the leading zeroes.
      
      Link: http://lkml.kernel.org/r/20200318230515.171692-5-ebiggers@kernel.org
      
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      45b9cf50
    • Eric Biggers's avatar
      docs-admin-guide-document-the-kernelmodprobe-sysctl-v5 · f53a2ce3
      Eric Biggers authored
      Link: http://lkml.kernel.org/r/20200318230515.171692-4-ebiggers@kernel.org
      
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      f53a2ce3
    • Eric Biggers's avatar
      docs: admin-guide: document the kernel.modprobe sysctl · 6a697e75
      Eric Biggers authored
      Document the kernel.modprobe sysctl in the same place that all the other
      kernel.* sysctls are documented.  Make sure to mention how to use this
      sysctl to completely disable module autoloading, and how this sysctl
      relates to CONFIG_STATIC_USERMODEHELPER.
      
      Link: http://lkml.kernel.org/r/20200312202552.241885-4-ebiggers@kernel.org
      
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      6a697e75
    • Eric Biggers's avatar
      fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once() · ad184f6c
      Eric Biggers authored
      After request_module(), nothing is stopping the module from being unloaded
      until someone takes a reference to it via try_get_module().
      
      The WARN_ONCE() in get_fs_type() is thus user-reachable, via userspace
      running 'rmmod' concurrently.
      
      Since WARN_ONCE() is for kernel bugs only, not for user-reachable
      situations, downgrade this warning to pr_warn_once().
      
      Keep it printed once only, since the intent of this warning is to detect a
      bug in modprobe at boot time.  Printing the warning more than once
      wouldn't really provide any useful extra information.
      
      Link: http://lkml.kernel.org/r/20200312202552.241885-3-ebiggers@kernel.org
      Fixes: 41124db8
      
       ("fs: warn in case userspace lied about modprobe return")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarJessica Yu <jeyu@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: NeilBrown <neilb@suse.com>
      Cc: <stable@vger.kernel.org>		[4.13+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      ad184f6c
    • Eric Biggers's avatar
      kmod: make request_module() return an error when autoloading is disabled · 001496e3
      Eric Biggers authored
      Patch series "module autoloading fixes and cleanups", v5.
      
      This series fixes a bug where request_module() was reporting success to
      kernel code when module autoloading had been completely disabled via 'echo
      > /proc/sys/kernel/modprobe'.
      
      It also addresses the issues raised on the original thread
      (https://lkml.kernel.org/lkml/20200310223731.126894-1-ebiggers@kernel.org/T/#u)
      by documenting the modprobe sysctl, adding a self-test for the empty path
      case, and downgrading a user-reachable WARN_ONCE().
      
      This patch (of 4):
      
      It's long been possible to disable kernel module autoloading completely
      (while still allowing manual module insertion) by setting
      /proc/sys/kernel/modprobe to the empty string.  This can be preferable to
      setting it to a nonexistent file since it avoids the overhead of an
      attempted execve(), avoids potential deadlocks, and avoids the call to
      security_kernel_module_request() and thus on SELinux-based systems
      eliminates the need to write SELinux rules to dontaudit module_request.
      
      However, when module autoloading is disabled in this way, request_module()
      returns 0.  This is broken because callers expect 0 to mean that the
      module was successfully loaded.
      
      Apparently this was never noticed because this method of disabling module
      autoloading isn't used much, and also most callers don't use the return
      value of request_module() since it's always necessary to check whether the
      module registered its functionality or not anyway.  But improperly
      returning 0 can indeed confuse a few callers, for example get_fs_type() in
      fs/filesystems.c where it causes a WARNING to be hit:
      
      	if (!fs && (request_module("fs-%.*s", len, name) == 0)) {
      		fs = __get_fs_type(name, len);
      		WARN_ONCE(!fs, "request_module fs-%.*s succeeded, but still no fs?\n", len, name);
      	}
      
      This is easily reproduced with:
      
      	echo > /proc/sys/kernel/modprobe
      	mount -t NONEXISTENT none /
      
      It causes:
      
      	request_module fs-NONEXISTENT succeeded, but still no fs?
      	WARNING: CPU: 1 PID: 1106 at fs/filesystems.c:275 get_fs_type+0xd6/0xf0
      	[...]
      
      This should actually use pr_warn_once() rather than WARN_ONCE(), since
      it's also user-reachable if userspace immediately unloads the module.
      Regardless, request_module() should correctly return an error when it
      fails.  So let's make it return -ENOENT, which matches the error when the
      modprobe binary doesn't exist.
      
      I've also sent patches to document and test this case.
      
      Link: http://lkml.kernel.org/r/20200310223731.126894-1-ebiggers@kernel.org
      Link: http://lkml.kernel.org/r/20200312202552.241885-1-ebiggers@kernel.org
      
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarJessica Yu <jeyu@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Ben Hutchings <benh@debian.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      001496e3
    • Oleksandr Natalenko's avatar
      mm/madvise: allow KSM hints for remote API · 5885ef0e
      Oleksandr Natalenko authored
      It all began with the fact that KSM works only on memory that is marked by
      madvise().  And the only way to get around that is to either:
      
        * use LD_PRELOAD; or
        * patch the kernel with something like UKSM or PKSM.
      
      (i skip ptrace can of worms here intentionally)
      
      To overcome this restriction, lets employ a new remote madvise API.  This
      can be used by some small userspace helper daemon that will do auto-KSM
      job for us.
      
      I think of two major consumers of remote KSM hints:
      
        * hosts, that run containers, especially similar ones and especially in
          a trusted environment, sharing the same runtime like Node.js;
      
        * heavy applications, that can be run in multiple instances, not
          limited to opensource ones like Firefox, but also those that cannot be
          modified since they are binary-only and, maybe, statically linked.
      
      Speaking of statistics, more numbers can be found in the very first
      submission, that is related to this one [1].  For my current setup with
      two Firefox instances I get 100 to 200 MiB saved for the second instance
      depending on the amount of tabs.
      
      1 FF instance with 15 tabs:
      
         $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
         410
      
      2 FF instances, second one has 12 tabs (all the tabs are different):
      
         $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
         592
      
      At the very moment I do not have specific numbers for containerised
      workload, but those should be comparable in case the containers share
      similar/same runtime.
      
      [1] https://lore.kernel.org/patchwork/patch/1012142/
      
      Link: http://lkml.kernel.org/r/20200302193630.68771-8-minchan@kernel.org
      
      Signed-off-by: default avatarOleksandr Natalenko <oleksandr@redhat.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSeongJae Park <sjpark@amazon.de>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <linux-man@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      5885ef0e
    • Oleksandr Natalenko's avatar
      mm/madvise: employ mmget_still_valid() for write lock · dfaef919
      Oleksandr Natalenko authored
      Do the very same trick as we already do since 04f5866e.  KSM hints
      will require locking mmap_sem for write since they modify vm_flags, so for
      remote KSM hinting this additional check is needed.
      
      Link: http://lkml.kernel.org/r/20200302193630.68771-7-minchan@kernel.org
      
      Signed-off-by: default avatarOleksandr Natalenko <oleksandr@redhat.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <linux-man@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      dfaef919
    • Minchan Kim's avatar
      mm/madvise: support both pid and pidfd for process_madvise · 8730fa91
      Minchan Kim authored
      There is a demand[1] to support pid as well pidfd for process_madvise to
      reduce unnecessary syscall to get pidfd if the user has control of the
      target process(ie, they could guarantee the process is not gone or pid is
      not reused).
      
      This patch aims for supporting both options like waitid(2).  So, the
      syscall is currently,
      
      	int process_madvise(int which, pid_t pid, void *addr,
      		size_t length, int advise, unsigned long flag);
      
      @which is actually idtype_t for userspace libray and currently, it
      supports P_PID and P_PIDFD.
      
      [1]  https://lore.kernel.org/linux-mm/9d849087-3359-c4ab-fbec-859e8186c509@virtuozzo.com/
      
      Link: http://lkml.kernel.org/r/20200302193630.68771-6-minchan@kernel.org
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Suggested-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: <linux-man@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      8730fa91
    • Minchan Kim's avatar
      pid: move pidfd_get_pid() to pid.c · 94acb045
      Minchan Kim authored
      process_madvise syscall needs pidfd_get_pid function to translate pidfd to
      pid so this patch move the function to kernel/pid.c.
      
      Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Suggested-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: <linux-man@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      94acb045
    • Minchan Kim's avatar
      mm/madvise: check fatal signal pending of target process · 1adbfbfe
      Minchan Kim authored
      Bail out to prevent unnecessary CPU overhead if target process has pending
      fatal signal during (MADV_COLD|MADV_PAGEOUT) operation.
      
      Link: http://lkml.kernel.org/r/20200302193630.68771-4-minchan@kernel.org
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: <linux-man@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      1adbfbfe
    • Minchan Kim's avatar
      fix process_madvise build break for arm64 · 5a99edc0
      Minchan Kim authored
      0-day reported build break from process_madvise on ARM64.
      
         aarch64-linux-ld: arch/arm64/kernel/head.o: relocation R_AARCH64_ABS32 against `_kernel_offset_le_lo32' can not be used when making a shared object
         aarch64-linux-ld: arch/arm64/kernel/efi-entry.stub.o: relocation R_AARCH64_ABS32 against `__efistub_stext_offset' can not be used when making a shared object
         arch/arm64/kernel/head.o: In function `kimage_vaddr':
         (.idmap.text+0x0): dangerous relocation: unsupported relocation
         arch/arm64/kernel/head.o: In function `__primary_switch':
         (.idmap.text+0x378): dangerous relocation: unsupported relocation
         (.idmap.text+0x380): dangerous relocation: unsupported relocation
      >> arch/arm64/kernel/sys32.o:(.rodata+0xdb8): undefined reference to `__arm64_process_madvise'
      
      This patch should fix it.
      
      Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      5a99edc0
    • Minchan Kim's avatar
      mm/madvise: introduce process_madvise() syscall: an external memory hinting API · 3b6306c6
      Minchan Kim authored
      There is usecase that System Management Software(SMS) want to give a
      memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case
      of Android, it is the ActivityManagerService.
      
      It's similar in spirit to madvise(MADV_WONTNEED), but the information
      required to make the reclaim decision is not known to the app.  Instead,
      it is known to the centralized userspace daemon(ActivityManagerService),
      and that daemon must be able to initiate reclaim on its own without any
      app involvement.
      
      To solve the issue, this patch introduces a new syscall
      process_madvise(2).  It uses pidfd of an external process to give the
      hint.
      
       int process_madvise(int pidfd, void *addr, size_t length, int advise,
      			unsigned long flag);
      
      Since it could affect other process's address range, only privileged
      process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) gives
      it the right to ptrace the process could use it successfully.  The flag
      argument is reserved for future use if we need to extend the API.
      
      I think supporting all hints madvise has/will supported/support to
      process_madvise is rather risky.  Because we are not sure all hints make
      sense from external process and implementation for the hint may rely on
      the caller being in the current context so it could be error-prone.  Thus,
      I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
      
      If someone want to add other hints, we could hear hear the usecase and
      review it for each hint.  It's safer for maintenance rather than
      introducing a buggy syscall but hard to fix it later.
      
      Q.1 - Why does any external entity have better knowledge?
      
      Quote from Sandeep
      
      "For Android, every application (including the special SystemServer) are
      forked from Zygote.  The reason of course is to share as many libraries
      and classes between the two as possible to benefit from the preloading
      during boot.
      
      After applications start, (almost) all of the APIs end up calling into
      this SystemServer process over IPC (binder) and back to the application.
      
      In a fully running system, the SystemServer monitors every single process
      periodically to calculate their PSS / RSS and also decides which process
      is "important" to the user for interactivity.
      
      So, because of how these processes start _and_ the fact that the
      SystemServer is looping to monitor each process, it does tend to *know*
      which address range of the application is not used / useful.
      
      Besides, we can never rely on applications to clean things up themselves.
      We've had the "hey app1, the system is low on memory, please trim your
      memory usage down" notifications for a long time[1].  They rely on
      applications honoring the broadcasts and very few do.
      
      So, if we want to avoid the inevitable killing of the application and
      restarting it, some way to be able to tell the OS about unimportant memory
      in these applications will be useful.
      
      - ssp
      
      Q.2 - How to guarantee the race(i.e., object validation) between when
      giving a hint from an external process and get the hint from the target
      process?
      
      process_madvise operates on the target process's address space as it
      exists at the instant that process_madvise is called.  If the space target
      process can run between the time the process_madvise process inspects the
      target process address space and the time that process_madvise is actually
      called, process_madvise may operate on memory regions that the calling
      process does not expect.  It's the responsibility of the process calling
      process_madvise to close this race condition.  For example, the calling
      process can suspend the target process with ptrace, SIGSTOP, or the
      freezer cgroup so that it doesn't have an opportunity to change its own
      address space before process_madvise is called.  Another option is to
      operate on memory regions that the caller knows a priori will be unchanged
      in the target process.  Yet another option is to accept the race for
      certain process_madvise calls after reasoning that mistargeting will do no
      harm.  The suggested API itself does not provide synchronization.  It also
      apply other APIs like move_pages, process_vm_write.
      
      The race isn't really a problem though.  Why is it so wrong to require
      that callers do their own synchronization in some manner?  Nobody objects
      to write(2) merely because it's possible for two processes to open the
      same file and clobber each other's writes --- instead, we tell people to
      use flock or something.  Think about mmap.  It never guarantees newly
      allocated address space is still valid when the user tries to access it
      because other threads could unmap the memory right before.  That's where
      we need synchronization by using other API or design from userside.  It
      shouldn't be part of API itself.  If someone needs more fine-grained
      synchronization rather than process level, there were two ideas suggested
      - cookie[2] and anon-fd[3].  Both are applicable via using last reserved
      argument of the API but I don't think it's necessary right now since we
      have already ways to prevent the race so don't want to add additional
      complexity with more fine-grained optimization model.
      
      To make the API extend, it reserved an unsigned long as last argument so
      we could support it in future if someone really needs it.
      
      Q.3 - Why doesn't ptrace work?
      
      Injecting an madvise in the target process using ptrace would not work for
      us because such injected madvise would have to be executed by the target
      process, which means that process would have to be runnable and that
      creates the risk of the abovementioned race and hinting a wrong VMA.
      Furthermore, we want to act the hint in caller's context, not the
      callee's, because the callee is usually limited in cpuset/cgroups or even
      freezed state so they can't act by themselves quick enough, which causes
      more thrashing/kill.  It doesn't work if the target process are
      ptraced(e.g., strace, debugger, minidump) because a process can have at
      most one ptracer.
      
      [1] https://developer.android.com/topic/performance/memory"
      
      [2] process_getinfo for getting the cookie which is updated whenever
          vma of process address layout are changed - Daniel Colascione -
          https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
      
      [3] anonymous fd which is used for the object(i.e., address range)
          validation - Michal Hocko -
          https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
      
      Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: <linux-man@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      3b6306c6
    • Minchan Kim's avatar
      mm/madvise: pass task and mm to do_madvise · d0d65be6
      Minchan Kim authored
      Patch series "introduce memory hinting API for external process", v7.
      
      Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API.  With
      that, application could give hints to kernel what memory range are
      preferred to be reclaimed.  However, in some platform(e.g., Android), the
      information required to make the hinting decision is not known to the app.
      Instead, it is known to a centralized userspace daemon(e.g.,
      ActivityManagerService), and that daemon must be able to initiate reclaim
      on its own without any app involvement.
      
      To solve the concern, this patch introduces new syscall -
      process_madvise(2).  Bascially, it's same with madvise(2) syscall but it
      has some differences.
      
      1. It needs pidfd of target process to provide the hint
      
      2.  It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
         moment.  Other hints in madvise will be opened when there are explicit
         requests from community to prevent unexpected bugs we couldn't support.
      
      3.  Only privileged processes can do something for other process's
         address space.
      
      For more detail of the new API, please see "mm: introduce external memory
      hinting API" description in this patchset.
      
      This patch (of 7):
      
      In upcoming patches, do_madvise will be called from external process
      context so we shouldn't asssume "current" is always hinted process's
      task_struct.  Furthermore, we couldn't access mm_struct via task->mm once
      it's verified by access_mm which will be introduced in next patch[1].  And
      let's pass *current* and current->mm as arguments of do_madvise so it
      shouldn't change existing behavior but prepare next patch to make review
      easy.
      
      Note: io_madvise pass NULL as target_task argument of do_madvise because
      it couldn't know who is target.
      
      [1] http://lore.kernel.org/r/CAG48ez27=pwm5m_N_988xT1huO7g7h6arTQL44zev6TD-h-7Tg@mail.gmail.com
      
      Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: <linux-man@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      d0d65be6
    • Logan Gunthorpe's avatar
      mm/memremap: set caching mode for PCI P2PDMA memory to WC · 5a352894
      Logan Gunthorpe authored
      PCI BAR IO memory should never be mapped as WB, however prior to this the
      PAT bits were set WB and it was typically overridden by MTRR registers set
      by the firmware.
      
      Set PCI P2PDMA memory to be UC as this is what it currently, typically,
      ends up being mapped as on x86 after the MTRR registers override the cache
      setting.
      
      Future use-cases may need to generalize this by adding flags to select the
      caching type, as some P2PDMA cases may not want UC.  However, those
      use-cases are not upstream yet and this can be changed when they arrive.
      
      Link: http://lkml.kernel.org/r/20200306170846.9333-8-logang@deltatee.com
      
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      5a352894
    • Logan Gunthorpe's avatar
      mm/memory_hotplug: add pgprot_t to mhp_params · b0c1a4f5
      Logan Gunthorpe authored
      devm_memremap_pages() is currently used by the PCI P2PDMA code to create
      struct page mappings for IO memory.  At present, these mappings are
      created with PAGE_KERNEL which implies setting the PAT bits to be WB.
      However, on x86, an mtrr register will typically override this and force
      the cache type to be UC-.  In the case firmware doesn't set this register
      it is effectively WB and will typically result in a machine check
      exception when it's accessed.
      
      Other arches are not currently likely to function correctly seeing they
      don't have any MTRR registers to fall back on.
      
      To solve this, provide a way to specify the pgprot value explicitly to
      arch_add_memory().
      
      Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple
      change to pass the pgprot_t down to their respective functions which set
      up the page tables.  For x86_32, set the page tables explicitly using
      _set_memory_prot() (seeing they are already mapped).  For ia64, s390 and
      sh, reject anything but PAGE_KERNEL settings -- this should be fine, for
      now, seeing these architectures don't support ZONE_DEVICE.
      
      A check in __add_pages() is also added to ensure the pgprot parameter was
      set for all arches.
      
      Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.com
      
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      b0c1a4f5
    • Logan Gunthorpe's avatar
      powerpc/mm: thread pgprot_t through create_section_mapping() · 6f6c7d0c
      Logan Gunthorpe authored
      In prepartion to support a pgprot_t argument for arch_add_memory().
      
      Link: http://lkml.kernel.org/r/20200306170846.9333-6-logang@deltatee.com
      
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      6f6c7d0c
    • Logan Gunthorpe's avatar
      x86/mm: introduce __set_memory_prot() · 9b0249eb
      Logan Gunthorpe authored
      For use in the 32bit arch_add_memory() to set the pgprot type of the
      memory to add.
      
      Link: http://lkml.kernel.org/r/20200306170846.9333-5-logang@deltatee.com
      
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      9b0249eb
    • Logan Gunthorpe's avatar
      x86/mm: thread pgprot_t through init_memory_mapping() · a05025a0
      Logan Gunthorpe authored
      In preparation to support a pgprot_t argument for arch_add_memory().
      
      It's required to move the prototype of init_memory_mapping() seeing the
      original location came before the definition of pgprot_t.
      
      Link: http://lkml.kernel.org/r/20200306170846.9333-4-logang@deltatee.com
      
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      a05025a0
    • Logan Gunthorpe's avatar
      mm/memory_hotplug: rename mhp_restrictions to mhp_params · fe4a5878
      Logan Gunthorpe authored
      The mhp_restrictions struct really doesn't specify anything resembling a
      restriction anymore so rename it to be mhp_params as it is a list of
      extended parameters.
      
      Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.com
      
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      fe4a5878
    • Logan Gunthorpe's avatar
      mm/memory_hotplug: drop the flags field from struct mhp_restrictions · dcba4ef2
      Logan Gunthorpe authored
      Patch series "Allow setting caching mode in arch_add_memory() for P2PDMA", v4.
      
      Currently, the page tables created using memremap_pages() are always
      created with the PAGE_KERNEL cacheing mode.  However, the P2PDMA code is
      creating pages for PCI BAR memory which should never be accessed through
      the cache and instead use either WC or UC.  This still works in most
      cases, on x86, because the MTRR registers typically override the caching
      settings in the page tables for all of the IO memory to be UC-.  However,
      this tends not to work so well on other arches or some rare x86 machines
      that have firmware which does not setup the MTRR registers in this way.
      
      Instead of this, this series proposes a change to arch_add_memory() to
      take the pgprot required by the mapping which allows us to explicitly set
      pagetable entries for P2PDMA memory to UC.
      
      This changes is pretty routine for most of the arches: x86_64, arm64 and
      powerpc simply need to thread the pgprot through to where the page tables
      are setup.  x86_32 unfortunately sets up the page tables at boot so must
      use _set_memory_prot() to change their caching mode.  ia64, s390 and sh
      don't appear to have an easy way to change the page tables so, for now at
      least, we just return -EINVAL on such mappings and thus they will not
      support P2PDMA memory until the work for this is done.  This should be
      fine as they don't yet support ZONE_DEVICE.
      
      This patch (of 7):
      
      This variable is not used anywhere and should therefore be removed from
      the structure.
      
      Link: http://lkml.kernel.org/r/20200306170846.9333-2-logang@deltatee.com
      
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      dcba4ef2
    • Anshuman Khandual's avatar
      mm/debug: add tests validating architecture page table helpers · 0f0e1f35
      Anshuman Khandual authored
      This adds tests which will validate architecture page table helpers and
      other accessors in their compliance with expected generic MM semantics.
      This will help various architectures in validating changes to existing
      page table helpers or addition of new ones.
      
      This test covers basic page table entry transformations including but not
      limited to old, young, dirty, clean, write, write protect etc at various
      level along with populating intermediate entries with next page table page
      and validating them.
      
      Test page table pages are allocated from system memory with required size
      and alignments.  The mapped pfns at page table levels are derived from a
      real pfn representing a valid kernel text symbol.  This test gets called
      inside kernel_init() right after async_synchronize_full().
      
      This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected.
      Any architecture, which is willing to subscribe this test will need to
      select ARCH_HAS_DEBUG_VM_PGTABLE.  For now this is limited to arc, arm64,
      x86, s390 and powerpc platforms where the test is known to build and run
      successfully Going forward, other architectures too can subscribe the test
      after fixing any build or runtime problems with their page table helpers.
      Meanwhile for better platform coverage, the test can also be enabled with
      CONFIG_EXPERT even without ARCH_HAS_DEBUG_VM_PGTABLE.
      
      Folks interested in making sure that a given platform's page table helpers
      conform to expected generic MM semantics should enable the above config
      which will just trigger this test during boot.  Any non conformity here
      will be reported as an warning which would need to be fixed.  This test
      will help catch any changes to the agreed upon semantics expected from
      generic MM and enable platforms to accommodate it thereafter.
      
      Link: http://lkml.kernel.org/r/1583919272-24178-1-git-send-email-anshuman.khandual@arm.com
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Suggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	# s390
      Tested-by: Christophe Leroy <christophe.leroy@c-s.fr>	# ppc32
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      0f0e1f35
    • Anshuman Khandual's avatar
      mm-special-create-generic-fallbacks-for-pte_special-and-pte_mkspecial-v3 · eb7562b1
      Anshuman Khandual authored
      use defined(CONFIG_ARCH_HAS_PTE_SPECIAL) in mips per Thomas
      
      Link: http://lkml.kernel.org/r/1583851924-21603-1-git-send-email-anshuman.khandual@arm.com
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      eb7562b1
    • Anshuman Khandual's avatar
      mm/special: create generic fallbacks for pte_special() and pte_mkspecial() · b58896ea
      Anshuman Khandual authored
      Currently there are many platforms that dont enable ARCH_HAS_PTE_SPECIAL
      but required to define quite similar fallback stubs for special page table
      entry helpers such as pte_special() and pte_mkspecial(), as they get build
      in generic MM without a config check.  This creates two generic fallback
      stub definitions for these helpers, eliminating much code duplication.
      
      mips platform has a special case where pte_special() and pte_mkspecial()
      visibility is wider than what ARCH_HAS_PTE_SPECIAL enablement requires.
      This restricts those symbol visibility in order to avoid redefinitions
      which is now exposed through this new generic stubs and subsequent build
      failure.  arm platform set_pte_at() definition needs to be moved into a C
      file just to prevent a build failure.
      
      Link: http://lkml.kernel.org/r/1583802551-15406-1-git-send-email-anshuman.khandual@arm.com
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: Guo Ren <guoren@kernel.org>			[csky]
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Acked-by: Stafford Horne <shorne@gmail.com>		[openrisc]
      Acked-by: Helge Deller <deller@gmx.de>			[parisc]
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Sam Creasey <sammy@sammy.net>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      b58896ea
    • Anshuman Khandual's avatar
      mm/vma: introduce VM_ACCESS_FLAGS · 3d430ad1
      Anshuman Khandual authored
      There are many places where all basic VMA access flags (read, write, exec)
      are initialized or checked against as a group.  One such example is during
      page fault.  Existing vma_is_accessible() wrapper already creates the
      notion of VMA accessibility as a group access permissions.  Hence lets
      just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which will not only
      reduce code duplication but also extend the VMA accessibility concept in
      general.
      
      Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.com
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rob Springer <rspringer@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      3d430ad1
    • Anshuman Khandual's avatar
      mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS · b277b074
      Anshuman Khandual authored
      There are many platforms with exact same value for VM_DATA_DEFAULT_FLAGS
      This creates a default value for VM_DATA_DEFAULT_FLAGS in line with the
      existing VM_STACK_DEFAULT_FLAGS.  While here, also define some more macros
      with standard VMA access flag combinations that are used frequently across
      many platforms.  Apart from simplification, this reduces code duplication
      as well.
      
      Link: http://lkml.kernel.org/r/1583391014-8170-2-git-send-email-anshuman.khandual@arm.com
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      b277b074
    • Andrew Morton's avatar
      net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix · 23a88055
      Andrew Morton authored
      
      
      Cc: Arjun Roy <arjunroy@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arjun Roy <arjunroy.kdev@gmail.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      23a88055
    • Arjun Roy's avatar
      net-zerocopy: use vm_insert_pages() for tcp rcv zerocopy · dba724db
      Arjun Roy authored
      Use vm_insert_pages() for tcp receive zerocopy.  Spin lock cycles (as
      reported by perf) drop from a couple of percentage points to a fraction of
      a percent.  This results in a roughly 6% increase in efficiency, measured
      roughly as zerocopy receive count divided by CPU utilization.
      
      The intention of this patchset is to reduce atomic ops for tcp zerocopy
      receives, which normally hits the same spinlock multiple times
      consecutively.
      
      Link: http://lkml.kernel.org/r/20200128025958.43490-3-arjunroy.kdev@gmail.com
      
      Signed-off-by: default avatarArjun Roy <arjunroy@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      dba724db
    • Arjun Roy's avatar
      mm: vm_insert_pages() checks if pte_index defined. · 1e0a2954
      Arjun Roy authored
      pte_index() is either defined as a macro (e.g.  sparc64) or as an inlined
      function (e.g.  x86).  vm_insert_pages() depends on pte_index but it is
      not defined on all platforms (e.g.  m68k).
      
      To fix compilation of vm_insert_pages() on architectures not providing
      pte_index(), we perform the following fix:
      
      0. For platforms where it is meaningful, and defined as a macro, no
         change is needed.
      1. For platforms where it is meaningful and defined as an inlined
         function, and we want to use it with vm_insert_pages(), we define
         a degenerate macro of the form:  #define pte_index pte_index
      2. vm_insert_pages() checks for the existence of a pte_index macro
         definition. If found, it implements a batched insert. If not found,
         it devolves to calling vm_insert_page() in a loop.
      
      This patch implements step 2.
      
      v3 of this patch fixes a compilation warning for an unused method.
      v2 of this patch moved a macro definition to a more readable location.
      
      Link: http://lkml.kernel.org/r/20200228054714.204424-2-arjunroy.kdev@gmail.com
      
      Signed-off-by: default avatarArjun Roy <arjunroy@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      1e0a2954
    • Arjun Roy's avatar
      add missing page_count() check to vm_insert_pages(). · d34be3d5
      Arjun Roy authored
      Add missing page_count() check to vm_insert_pages(), specifically inside
      insert_page_in_batch_locked().  This was accidentally forgotten in the
      original patchset.
      
      See: https://marc.info/?l=linux-mm&m=158156166403807&w=2
      
      The intention of this patch-set is to reduce atomic ops for tcp zerocopy
      receives, which normally hits the same spinlock multiple times
      consecutively.
      
      Link: http://lkml.kernel.org/r/20200214005929.104481-1-arjunroy.kdev@gmail.com
      
      Signed-off-by: default avatarArjun Roy <arjunroy@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      d34be3d5