Skip to content
Snippets Groups Projects
  1. Jan 23, 2019
    • Linus Torvalds's avatar
      Revert "Change mincore() to count "mapped" pages rather than "cached" pages" · 30bac164
      Linus Torvalds authored
      
      This reverts commit 574823bf.
      
      It turns out that my hope that we could just remove the code that
      exposes the cache residency status from mincore() was too optimistic.
      
      There are various random users that want it, and one example would be
      the Netflix database cluster maintenance. To quote Josh Snyder:
      
       "For Netflix, losing accurate information from the mincore syscall
        would lengthen database cluster maintenance operations from days to
        months. We rely on cross-process mincore to migrate the contents of a
        page cache from machine to machine, and across reboots.
      
        To do this, I wrote and maintain happycache [1], a page cache
        dumper/loader tool. It is quite similar in architecture to pgfincore,
        except that it is agnostic to workload. The gist of happycache's
        operation is "produce a dump of residence status for each page, do
        some operation, then reload exactly the same pages which were present
        before." happycache is entirely dependent on accurate reporting of the
        in-core status of file-backed pages, as accessed by another process.
      
        We primarily use happycache with Cassandra, which (like Postgres +
        pgfincore) relies heavily on OS page cache to reduce disk accesses.
        Because our workloads never experience a cold page cache, we are able
        to provision hardware for a peak utilization level that is far lower
        than the hypothetical "every query is a cache miss" peak.
      
        A database warmed by happycache can be ready for service in seconds
        (bounded only by the performance of the drives and the I/O subsystem),
        with no period of in-service degradation. By contrast, putting a
        database in service without a page cache entails a potentially
        unbounded period of degradation (at Netflix, the time to populate a
        single node's cache via natural cache misses varies by workload from
        hours to weeks). If a single node upgrade were to take weeks, then
        upgrading an entire cluster would take months. Since we want to apply
        security upgrades (and other things) on a somewhat tighter schedule,
        we would have to develop more complex solutions to provide the same
        functionality already provided by mincore.
      
        At the bottom line, happycache is designed to benignly exploit the
        same information leak documented in the paper [2]. I think it makes
        perfect sense to remove cross-process mincore functionality from
        unprivileged users, but not to remove it entirely"
      
      We do have an alternate approach that limits the cache residency
      reporting only to processes that have write permissions to the file, so
      we can fix the original information leak issue that way.  It involves
      _adding_ code rather than removing it, which is sad, but hey, at least
      we haven't found any users that would find the restrictions
      unacceptable.
      
      So revert the optimistic first approach to make room for that alternate
      fix instead.
      
      Reported-by: default avatarJosh Snyder <joshs@netflix.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Dominique Martinet <asmadeus@codewreck.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Kevin Easton <kevin@guarana.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Cyril Hrubis <chrubis@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Daniel Gruss <daniel@gruss.cc>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30bac164
  2. Jan 22, 2019
  3. Jan 10, 2019
  4. Jan 09, 2019
  5. Jan 06, 2019
    • Linus Torvalds's avatar
      Change mincore() to count "mapped" pages rather than "cached" pages · 574823bf
      Linus Torvalds authored
      
      The semantics of what "in core" means for the mincore() system call are
      somewhat unclear, but Linux has always (since 2.3.52, which is when
      mincore() was initially done) treated it as "page is available in page
      cache" rather than "page is mapped in the mapping".
      
      The problem with that traditional semantic is that it exposes a lot of
      system cache state that it really probably shouldn't, and that users
      shouldn't really even care about.
      
      So let's try to avoid that information leak by simply changing the
      semantics to be that mincore() counts actual mapped pages, not pages
      that might be cheaply mapped if they were faulted (note the "might be"
      part of the old semantics: being in the cache doesn't actually guarantee
      that you can access them without IO anyway, since things like network
      filesystems may have to revalidate the cache before use).
      
      In many ways the old semantics were somewhat insane even aside from the
      information leak issue.  From the very beginning (and that beginning is
      a long time ago: 2.3.52 was released in March 2000, I think), the code
      had a comment saying
      
        Later we can get more picky about what "in core" means precisely.
      
      and this is that "later".  Admittedly it is much later than is really
      comfortable.
      
      NOTE! This is a real semantic change, and it is for example known to
      change the output of "fincore", since that program literally does a
      mmmap without populating it, and then doing "mincore()" on that mapping
      that doesn't actually have any pages in it.
      
      I'm hoping that nobody actually has any workflow that cares, and the
      info leak is real.
      
      We may have to do something different if it turns out that people have
      valid reasons to want the old semantics, and if we can limit the
      information leak sanely.
      
      Cc: Kevin Easton <kevin@guarana.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Masatake YAMATO <yamato@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      574823bf
  6. Jan 04, 2019
  7. Jan 02, 2019
    • Linus Torvalds's avatar
      block: don't use un-ordered __set_current_state(TASK_UNINTERRUPTIBLE) · 1ac5cd49
      Linus Torvalds authored
      
      This mostly reverts commit 849a3700 ("block: avoid ordered task
      state change for polled IO").  It was wrongly claiming that the ordering
      wasn't necessary.  The memory barrier _is_ necessary.
      
      If something is truly polling and not going to sleep, it's the whole
      state setting that is unnecessary, not the memory barrier.  Whenever you
      set your state to a sleeping state, you absolutely need the memory
      barrier.
      
      Note that sometimes the memory barrier can be elsewhere.  For example,
      the ordering might be provided by an external lock, or by setting the
      process state to sleeping before adding yourself to the wait queue list
      that is used for waking up (where the wait queue lock itself will
      guarantee that any wakeup will correctly see the sleeping state).
      
      But none of those cases were true here.
      
      NOTE! Some of the polling paths may indeed be able to drop the state
      setting entirely, at which point the memory barrier also goes away.
      
      (Also note that this doesn't revert the TASK_RUNNING cases: there is no
      race between a wakeup and setting the process state to TASK_RUNNING,
      since the end result doesn't depend on ordering).
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ac5cd49
  8. Dec 28, 2018
Loading