1. 07 Dec, 2017 1 commit
  2. 28 Nov, 2017 1 commit
    • Linus Torvalds's avatar
      proc: don't report kernel addresses in /proc/<pid>/stack · 8f5abe84
      Linus Torvalds authored
      This just changes the file to report them as zero, although maybe even
      that could be removed.  I checked, and at least procps doesn't actually
      seem to parse the 'stack' file at all.
      
      And since the file doesn't necessarily even exist (it requires
      CONFIG_STACKTRACE), possibly other tools don't really use it either.
      
      That said, in case somebody parses it with tools, just having that zero
      there should keep such tools happy.
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f5abe84
  3. 27 Nov, 2017 1 commit
    • Linus Torvalds's avatar
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds authored
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6
  4. 18 Nov, 2017 4 commits
  5. 16 Nov, 2017 3 commits
  6. 15 Nov, 2017 1 commit
    • Rafael J. Wysocki's avatar
      x86 / CPU: Always show current CPU frequency in /proc/cpuinfo · 7d5905dc
      Rafael J. Wysocki authored
      After commit 890da9cf (Revert "x86: do not use cpufreq_quick_get()
      for /proc/cpuinfo "cpu MHz"") the "cpu MHz" number in /proc/cpuinfo
      on x86 can be either the nominal CPU frequency (which is constant)
      or the frequency most recently requested by a scaling governor in
      cpufreq, depending on the cpufreq configuration.  That is somewhat
      inconsistent and is different from what it was before 4.13, so in
      order to restore the previous behavior, make it report the current
      CPU frequency like the scaling_cur_freq sysfs file in cpufreq.
      
      To that end, modify the /proc/cpuinfo implementation on x86 to use
      aperfmperf_snapshot_khz() to snapshot the APERF and MPERF feedback
      registers, if available, and use their values to compute the CPU
      frequency to be reported as "cpu MHz".
      
      However, do that carefully enough to avoid accumulating delays that
      lead to unacceptable access times for /proc/cpuinfo on systems with
      many CPUs.  Run aperfmperf_snapshot_khz() once on all CPUs
      asynchronously at the /proc/cpuinfo open time, add a single delay
      upfront (if necessary) at that point and simply compute the current
      frequency while running show_cpuinfo() for each individual CPU.
      
      Also, to avoid slowing down /proc/cpuinfo accesses too much, reduce
      the default delay between consecutive APERF and MPERF reads to 10 ms,
      which should be sufficient to get large enough numbers for the
      frequency computation in all cases.
      
      Fixes: 890da9cf (Revert "x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"")
      Signed-off-by: 's avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: 's avatarIngo Molnar <mingo@kernel.org>
      7d5905dc
  7. 03 Nov, 2017 2 commits
  8. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: 's avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: 's avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  9. 25 Oct, 2017 1 commit
    • Mark Rutland's avatar
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland authored
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: 's avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: 's avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      6aa7de05
  10. 20 Oct, 2017 1 commit
    • nixiaoming's avatar
      tty fix oops when rmmod 8250 · c79dde62
      nixiaoming authored
      After rmmod 8250.ko
      tty_kref_put starts kwork (release_one_tty) to release proc interface
      oops when accessing driver->driver_name in proc_tty_unregister_driver
      
      Use jprobe, found driver->driver_name point to 8250.ko
      static static struct uart_driver serial8250_reg
      .driver_name= serial,
      
      Use name in proc_dir_entry instead of driver->driver_name to fix oops
      
      test on linux 4.1.12:
      
      BUG: unable to handle kernel paging request at ffffffffa01979de
      IP: [<ffffffff81310f40>] strchr+0x0/0x30
      PGD 1a0d067 PUD 1a0e063 PMD 851c1f067 PTE 0
      Oops: 0000 [#1] PREEMPT SMP
      Modules linked in: ... ...  [last unloaded: 8250]
      CPU: 7 PID: 116 Comm: kworker/7:1 Tainted: G           O    4.1.12 #1
      Hardware name: Insyde RiverForest/Type2 - Board Product Name1, BIOS NE5KV904 12/21/2015
      Workqueue: events release_one_tty
      task: ffff88085b684960 ti: ffff880852884000 task.ti: ffff880852884000
      RIP: 0010:[<ffffffff81310f40>]  [<ffffffff81310f40>] strchr+0x0/0x30
      RSP: 0018:ffff880852887c90  EFLAGS: 00010282
      RAX: ffffffff81a5eca0 RBX: ffffffffa01979de RCX: 0000000000000004
      RDX: ffff880852887d10 RSI: 000000000000002f RDI: ffffffffa01979de
      RBP: ffff880852887cd8 R08: 0000000000000000 R09: ffff88085f5d94d0
      R10: 0000000000000195 R11: 0000000000000000 R12: ffffffffa01979de
      R13: ffff880852887d00 R14: ffffffffa01979de R15: ffff88085f02e840
      FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffffffa01979de CR3: 0000000001a0c000 CR4: 00000000001406e0
      Stack:
       ffffffff812349b1 ffff880852887cb8 ffff880852887d10 ffff88085f5cd6c2
       ffff880852800a80 ffffffffa01979de ffff880852800a84 0000000000000010
       ffff88085bb28bd8 ffff880852887d38 ffffffff812354f0 ffff880852887d08
      Call Trace:
       [<ffffffff812349b1>] ? __xlate_proc_name+0x71/0xd0
       [<ffffffff812354f0>] remove_proc_entry+0x40/0x180
       [<ffffffff815f6811>] ? _raw_spin_lock_irqsave+0x41/0x60
       [<ffffffff813be520>] ? destruct_tty_driver+0x60/0xe0
       [<ffffffff81237c68>] proc_tty_unregister_driver+0x28/0x40
       [<ffffffff813be548>] destruct_tty_driver+0x88/0xe0
       [<ffffffff813be5bd>] tty_driver_kref_put+0x1d/0x20
       [<ffffffff813becca>] release_one_tty+0x5a/0xd0
       [<ffffffff81074159>] process_one_work+0x139/0x420
       [<ffffffff810745a1>] worker_thread+0x121/0x450
       [<ffffffff81074480>] ? process_scheduled_works+0x40/0x40
       [<ffffffff8107a16c>] kthread+0xec/0x110
       [<ffffffff81080000>] ? tg_rt_schedulable+0x210/0x220
       [<ffffffff8107a080>] ? kthread_freezable_should_stop+0x80/0x80
       [<ffffffff815f7292>] ret_from_fork+0x42/0x70
       [<ffffffff8107a080>] ? kthread_freezable_should_stop+0x80/0x80
      Signed-off-by: 's avatarnixiaoming <nixiaoming@huawei.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c79dde62
  11. 10 Oct, 2017 1 commit
  12. 30 Sep, 2017 1 commit
  13. 29 Sep, 2017 3 commits
  14. 15 Sep, 2017 1 commit
    • John Ogness's avatar
      fs/proc: Report eip/esp in /prod/PID/stat for coredumping · fd7d5627
      John Ogness authored
      Commit 0a1eb2d4 ("fs/proc: Stop reporting eip and esp in
      /proc/PID/stat") stopped reporting eip/esp because it is
      racy and dangerous for executing tasks. The comment adds:
      
          As far as I know, there are no use programs that make any
          material use of these fields, so just get rid of them.
      
      However, existing userspace core-dump-handler applications (for
      example, minicoredumper) are using these fields since they
      provide an excellent cross-platform interface to these valuable
      pointers. So that commit introduced a user space visible
      regression.
      
      Partially revert the change and make the readout possible for
      tasks with the proper permissions and only if the target task
      has the PF_DUMPCORE flag set.
      
      Fixes: 0a1eb2d4 ("fs/proc: Stop reporting eip and esp in> /proc/PID/stat")
      Reported-by: 's avatarMarco Felsch <marco.felsch@preh.de>
      Signed-off-by: 's avatarJohn Ogness <john.ogness@linutronix.de>
      Reviewed-by: 's avatarAndy Lutomirski <luto@kernel.org>
      Cc: Tycho Andersen <tycho.andersen@canonical.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Linux API <linux-api@vger.kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/87poatfwg6.fsf@linutronix.deSigned-off-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      fd7d5627
  15. 14 Sep, 2017 2 commits
    • Michal Hocko's avatar
      mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4
      Michal Hocko authored
      GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
      and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
      primary motivation was to allow users to tell that an allocation is
      short lived and so the allocator can try to place such allocations close
      together and prevent long term fragmentation.  As much as this sounds
      like a reasonable semantic it becomes much less clear when to use the
      highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
      context holding that memory sleep? Can it take locks? It seems there is
      no good answer for those questions.
      
      The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
      __GFP_RECLAIMABLE which in itself is tricky because basically none of
      the existing caller provide a way to reclaim the allocated memory.  So
      this is rather misleading and hard to evaluate for any benefits.
      
      I have checked some random users and none of them has added the flag
      with a specific justification.  I suspect most of them just copied from
      other existing users and others just thought it might be a good idea to
      use without any measuring.  This suggests that GFP_TEMPORARY just
      motivates for cargo cult usage without any reasoning.
      
      I believe that our gfp flags are quite complex already and especially
      those with highlevel semantic should be clearly defined to prevent from
      confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
      replace all existing users to simply use GFP_KERNEL.  Please note that
      SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
      so they will be placed properly for memory fragmentation prevention.
      
      I can see reasons we might want some gfp flag to reflect shorterm
      allocations but I propose starting from a clear semantic definition and
      only then add users with proper justification.
      
      This was been brought up before LSF this year by Matthew [1] and it
      turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
      seems to be a heuristic without any measured advantage for most (if not
      all) its current users.  The follow up discussion has revealed that
      opinions on what might be temporary allocation differ a lot between
      developers.  So rather than trying to tweak existing users into a
      semantic which they haven't expected I propose to simply remove the flag
      and start from scratch if we really need a semantic for short term
      allocations.
      
      [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: drm/i915: fix up]
        Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: 's avatarStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: 's avatarMel Gorman <mgorman@suse.de>
      Acked-by: 's avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ee931c4
    • Arnd Bergmann's avatar
      procfs: remove unused variable · 6dec0dd4
      Arnd Bergmann authored
      In NOMMU configurations, we get a warning about a variable that has become
      unused:
      
        fs/proc/task_nommu.c: In function 'nommu_vma_show':
        fs/proc/task_nommu.c:148:28: error: unused variable 'priv' [-Werror=unused-variable]
      
      Link: http://lkml.kernel.org/r/20170911200231.3171415-1-arnd@arndb.de
      Fixes: 1240ea0d ("fs, proc: remove priv argument from is_stack")
      Signed-off-by: 's avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      6dec0dd4
  16. 09 Sep, 2017 8 commits
  17. 07 Sep, 2017 3 commits
    • Rik van Riel's avatar
      mm,fork: introduce MADV_WIPEONFORK · d2cd9ede
      Rik van Riel authored
      Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
      in the child process after fork.  This differs from MADV_DONTFORK in one
      important way.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      MADV_WIPEONFORK only works on private, anonymous VMAs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
      Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.comSigned-off-by: 's avatarRik van Riel <riel@redhat.com>
      Reported-by: 's avatarFlorian Weimer <fweimer@redhat.com>
      Reported-by: 's avatarColm MacCártaigh <colm@allcosts.net>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2cd9ede
    • Daniel Colascione's avatar
      mm: add /proc/pid/smaps_rollup · 493b0e9d
      Daniel Colascione authored
      /proc/pid/smaps_rollup is a new proc file that improves the performance
      of user programs that determine aggregate memory statistics (e.g., total
      PSS) of a process.
      
      Android regularly "samples" the memory usage of various processes in
      order to balance its memory pool sizes.  This sampling process involves
      opening /proc/pid/smaps and summing certain fields.  For very large
      processes, sampling memory use this way can take several hundred
      milliseconds, due mostly to the overhead of the seq_printf calls in
      task_mmu.c.
      
      smaps_rollup improves the situation.  It contains most of the fields of
      /proc/pid/smaps, but instead of a set of fields for each VMA,
      smaps_rollup instead contains one synthetic smaps-format entry
      representing the whole process.  In the single smaps_rollup synthetic
      entry, each field is the summation of the corresponding field in all of
      the real-smaps VMAs.  Using a common format for smaps_rollup and smaps
      allows userspace parsers to repurpose parsers meant for use with
      non-rollup smaps for smaps_rollup, and it allows userspace to switch
      between smaps_rollup and smaps at runtime (say, based on the
      availability of smaps_rollup in a given kernel) with minimal fuss.
      
      By using smaps_rollup instead of smaps, a caller can avoid the
      significant overhead of formatting, reading, and parsing each of a large
      process's potentially very numerous memory mappings.  For sampling
      system_server's PSS in Android, we measured a 12x speedup, representing
      a savings of several hundred milliseconds.
      
      One alternative to a new per-process proc file would have been including
      PSS information in /proc/pid/status.  We considered this option but
      thought that PSS would be too expensive (by a few orders of magnitude)
      to collect relative to what's already emitted as part of
      /proc/pid/status, and slowing every user of /proc/pid/status for the
      sake of readers that happen to want PSS feels wrong.
      
      The code itself works by reusing the existing VMA-walking framework we
      use for regular smaps generation and keeping the mem_size_stats
      structure around between VMA walks instead of using a fresh one for each
      VMA.  In this way, summation happens automatically.  We let seq_file
      walk over the VMAs just as it does for regular smaps and just emit
      nothing to the seq_file until we hit the last VMA.
      
      Benchmarks:
      
          using smaps:
          iterations:1000 pid:1163 pss:220023808
          0m29.46s real 0m08.28s user 0m20.98s system
      
          using smaps_rollup:
          iterations:1000 pid:1163 pss:220702720
          0m04.39s real 0m00.03s user 0m04.31s system
      
      We're using the PSS samples we collect asynchronously for
      system-management tasks like fine-tuning oom_adj_score, memory use
      tracking for debugging, application-level memory-use attribution, and
      deciding whether we want to kill large processes during system idle
      maintenance windows.  Android has been using PSS for these purposes for
      a long time; as the average process VMA count has increased and and
      devices become more efficiency-conscious, PSS-collection inefficiency
      has started to matter more.  IMHO, it'd be a lot safer to optimize the
      existing PSS-collection model, which has been fine-tuned over the years,
      instead of changing the memory tracking approach entirely to work around
      smaps-generation inefficiency.
      
      Tim said:
      
      : There are two main reasons why Android gathers PSS information:
      :
      : 1. Android devices can show the user the amount of memory used per
      :    application via the settings app.  This is a less important use case.
      :
      : 2. We log PSS to help identify leaks in applications.  We have found
      :    an enormous number of bugs (in the Android platform, in Google's own
      :    apps, and in third-party applications) using this data.
      :
      : To do this, system_server (the main process in Android userspace) will
      : sample the PSS of a process three seconds after it changes state (for
      : example, app is launched and becomes the foreground application) and about
      : every ten minutes after that.  The net result is that PSS collection is
      : regularly running on at least one process in the system (usually a few
      : times a minute while the screen is on, less when screen is off due to
      : suspend).  PSS of a process is an incredibly useful stat to track, and we
      : aren't going to get rid of it.  We've looked at some very hacky approaches
      : using RSS ("take the RSS of the target process, subtract the RSS of the
      : zygote process that is the parent of all Android apps") to reduce the
      : accounting time, but it regularly overestimated the memory used by 20+
      : percent.  Accordingly, I don't think that there's a good alternative to
      : using PSS.
      :
      : We started looking into PSS collection performance after we noticed random
      : frequency spikes while a phone's screen was off; occasionally, one of the
      : CPU clusters would ramp to a high frequency because there was 200-300ms of
      : constant CPU work from a single thread in the main Android userspace
      : process.  The work causing the spike (which is reasonable governor
      : behavior given the amount of CPU time needed) was always PSS collection.
      : As a result, Android is burning more power than we should be on PSS
      : collection.
      :
      : The other issue (and why I'm less sure about improving smaps as a
      : long-term solution) is that the number of VMAs per process has increased
      : significantly from release to release.  After trying to figure out why we
      : were seeing these 200-300ms PSS collection times on Android O but had not
      : noticed it in previous versions, we found that the number of VMAs in the
      : main system process increased by 50% from Android N to Android O (from
      : ~1800 to ~2700) and varying increases in every userspace process.  Android
      : M to N also had an increase in the number of VMAs, although not as much.
      : I'm not sure why this is increasing so much over time, but thinking about
      : ASLR and ways to make ASLR better, I expect that this will continue to
      : increase going forward.  I would not be surprised if we hit 5000 VMAs on
      : the main Android process (system_server) by 2020.
      :
      : If we assume that the number of VMAs is going to increase over time, then
      : doing anything we can do to reduce the overhead of each VMA during PSS
      : collection seems like the right way to go, and that means outputting an
      : aggregate statistic (to avoid whatever overhead there is per line in
      : writing smaps and in reading each line from userspace).
      
      Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.comSigned-off-by: 's avatarDaniel Colascione <dancol@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      493b0e9d
    • Michal Hocko's avatar
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko authored
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
  18. 10 Aug, 2017 3 commits
    • Minchan Kim's avatar
      mm: fix KSM data corruption · b3a81d08
      Minchan Kim authored
      Nadav reported KSM can corrupt the user data by the TLB batching
      race[1].  That means data user written can be lost.
      
      Quote from Nadav Amit:
       "For this race we need 4 CPUs:
      
        CPU0: Caches a writable and dirty PTE entry, and uses the stale value
        for write later.
      
        CPU1: Runs madvise_free on the range that includes the PTE. It would
        clear the dirty-bit. It batches TLB flushes.
      
        CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty.
        We care about the fact that it clears the PTE write-bit, and of
        course, batches TLB flushes.
      
        CPU3: Runs KSM. Our purpose is to pass the following test in
        write_protect_page():
      
      	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
      	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
      
        Since it will avoid TLB flush. And we want to do it while the PTE is
        stale. Later, and before replacing the page, we would be able to
        change the page.
      
        Note that all the operations the CPU1-3 perform canhappen in parallel
        since they only acquire mmap_sem for read.
      
        We start with two identical pages. Everything below regards the same
        page/PTE.
      
        CPU0        CPU1        CPU2        CPU3
        ----        ----        ----        ----
        Write the same
        value on page
      
        [cache PTE as
         dirty in TLB]
      
                    MADV_FREE
                    pte_mkclean()
      
                                4 > clear_refs
                                pte_wrprotect()
      
                                            write_protect_page()
                                            [ success, no flush ]
      
                                            pages_indentical()
                                            [ ok ]
      
        Write to page
        different value
      
        [Ok, using stale
         PTE]
      
                                            replace_page()
      
        Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late.
        CPU0 already wrote on the page, but KSM ignored this write, and it got
        lost"
      
      In above scenario, MADV_FREE is fixed by changing TLB batching API
      including [set|clear]_tlb_flush_pending.  Remained thing is soft-dirty
      part.
      
      This patch changes soft-dirty uses TLB batching API instead of
      flush_tlb_mm and KSM checks pending TLB flush by using
      mm_tlb_flush_pending so that it will flush TLB to avoid data lost if
      there are other parallel threads pending TLB flush.
      
      [1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.comSigned-off-by: 's avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: 's avatarNadav Amit <namit@vmware.com>
      Reported-by: 's avatarNadav Amit <namit@vmware.com>
      Tested-by: 's avatarNadav Amit <namit@vmware.com>
      Reviewed-by: 's avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3a81d08
    • Johannes Weiner's avatar
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner authored
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.orgSigned-off-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: 's avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
    • Aleksa Sarai's avatar
      sched/debug: Use task_pid_nr_ns in /proc/$pid/sched · 74dc3384
      Aleksa Sarai authored
      It appears as though the addition of the PID namespace did not update
      the output code for /proc/*/sched, which resulted in it providing PIDs
      that were not self-consistent with the /proc mount. This additionally
      made it trivial to detect whether a process was inside &init_pid_ns from
      userspace, making container detection trivial:
      
         https://github.com/jessfraz/amicontained
      
      This leads to situations such as:
      
        % unshare -pmf
        % mount -t proc proc /proc
        % head -n1 /proc/1/sched
        head (10047, #threads: 1)
      
      Fix this by just using task_pid_nr_ns for the output of /proc/*/sched.
      All of the other uses of task_pid_nr in kernel/sched/debug.c are from a
      sysctl context and thus don't need to be namespaced.
      Signed-off-by: 's avatarAleksa Sarai <asarai@suse.com>
      Signed-off-by: 's avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: 's avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: Jess Frazelle <acidburn@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: cyphar@cyphar.com
      Link: http://lkml.kernel.org/r/20170806044141.5093-1-asarai@suse.comSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      74dc3384
  19. 17 Jul, 2017 2 commits
    • Logan Gunthorpe's avatar
      block: order /proc/devices by major number · 133d55cd
      Logan Gunthorpe authored
      Presently, the order of the block devices listed in /proc/devices is not
      entirely sequential. If a block device has a major number greater than
      BLKDEV_MAJOR_HASH_SIZE (255), it will be ordered as if its major were
      module 255. For example, 511 appears after 1.
      
      This patch cleans that up and prints each major number in the correct
      order, regardless of where they are stored in the hash table.
      
      In order to do this, we introduce BLKDEV_MAJOR_MAX as an artificial
      limit (chosen to be 512). It will then print all devices in major
      order number from 0 to the maximum.
      Signed-off-by: 's avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      133d55cd
    • Logan Gunthorpe's avatar
      char_dev: order /proc/devices by major number · 8a932f73
      Logan Gunthorpe authored
      Presently, the order of the char devices listed in /proc/devices is not
      entirely sequential. If a char device has a major number greater than
      CHRDEV_MAJOR_HASH_SIZE (255), it will be ordered as if its major were
      module 255. For example, 511 appears after 1.
      
      This patch cleans that up and prints each major number in the correct
      order, regardless of where they are stored in the hash table.
      
      In order to do this, we introduce CHRDEV_MAJOR_MAX as an artificial
      limit (chosen to be 511). It will then print all devices in major
      order number from 0 to the maximum.
      Signed-off-by: 's avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a932f73