1. 15 Nov, 2019 11 commits
    • Khazhismel Kumykov's avatar
      md: historical-service-time path selector · 746e74d8
      Khazhismel Kumykov authored
      Path selector weighing paths based on their historical measured service time.
      Selector keeps an exponential moving average of the service time for each path,
      and uses this along with the number of inflight requests to estimate future
      service time for a path.
      Since we don't have a prober, to account for temporally slow paths, re-try
      "slow" paths every once in a while (num_paths * historical_service_time)
      To account for fast paths transitioning to slow, if a path has not completed
      any request within (num_paths * historical_service_time), limit the number of
      outstanding requests.
      To account for low volume situations where number of inflights would be
      zero, the last finish time of each path is factored in.
      Can specify a minimum multiplier threshold before similar paths are weighted
      Upstream-Notes: probably, instead of relying on (internal) io_service_time_ns,
        instead just record start on io_start and finish on io_end. This needs some
        smarts to handle multiple outstanding requests. (io_service_time_ns will
          set to the time since the last finished request, rather than time since
          io was started)
        Also, won't compile :)))
      This is a manual test that outputs IOPS, latency percentiles, etc.
      comparing service-time and historical-service-time.
      Tested with randomreader.borg on iscsi-test wl-kerneltest for no
      ./dmsetup table
      multipath-2: 0 61440 multipath 3 queue_if_no_path
      queue_if_no_path_timeout_secs 600 0 1 1 historical-service-time 1 999 3 1
      66:192 1 66:208 1 66:224 1
      Google-Bug-Id: 33307640
      Origin-8xx-SHA1: 11ff181d99de7839a51dc481b495a49bb19d391a
      Origin-8xx-SHA1: 08f222a2697ff81c693cdd4e9549aad9d1a519ea
      dm-mpath: More evenly distribute ties for hst.
      Origin-8xx-SHA1: d94d2a6d277710c7a2cbbb644ba680fd8097190e
      dm-mpath, hst: Add minimum threshold for hst.
      This addes a threshold_multiplier parameter for historical-service-time
      to specify a minimum multiple threshold between different paths before
      they are considered different speeds.
      This cuts down on path "stickiness" due to very high variance paths,
      while still performing similarly in the case where one or two paths
      exceed the threshold compared to a non-bad path.
      In the case that two paths are too-similar, we only compare outstanding
      requests on the path.
      This bumps SRCFS_CL to cl/183153920
      Decreased path stickiness (avg_spree), and improved tail latencies for
      all_clear tests. Slightly degraded tail latencies for some one_bad_delay
      tests due to longer time-to-adjust period, which is an expected trade
      http://sponge/be21984e-07b6-4e9f-9ea3-3de515baefb0 (with cl/176530023)
      Origin-8xx-SHA1: d77cc3e38008c575590bda3ab2234ebcca514b5f
      drop "dm: bump DM_VERSION_MINOR for hst changes."
      Dropped-8xx-SHA1: 8df8c793720235d972a1926c24793850d235df24
      dm-mpath, hst: Only update stale on io_finish
      We should only update path stale time upon io completion, when we have a
      recent/accurate service_time measurement. Previously we were updating
      upon the first io_start for an unused path, with reasoning that a path
      may be stale due to underutilization rather than taking
      longer-than-expected to return.
      For synthetic 2 thread IO on a 3 target device with throttle, we were
      seeing the HST schedule 2 IOs on one path unexpectedly, causing tail
      latency regression, due to treating this stale data as fresh. With
      throttle, all three paths become ~10x slower at the same time, and with
      2 parallel IOs, two paths would have estimated service time updated to
      be 10x worse, leaving the third path with a stale 'good' estimate. HST
      would schedule both on the "fast" path, resulting in two queued
      throttled IOs. Normally, if the "better" path has a stale service_time
      estimate, we fall back to queue-length based scheduling.
      //fs/path_selector manually, fixes threaded_2_fio all_clear_delay
      throttled 99.99th regression for HST compared to ST, and thread_9_fio
      all_clear_delay unthrottled tail regression (this test is noisy,
      http://sponge/b9e69356-070f-406c-a9b8-c592d7c83a7 - before
      http://sponge/bc786d73-26db-4687-a1bf-c70ce95056de - after
      Notes for upstream:
      The timing relies on io_start_time_ns (submitted to driver) and
      io_service_time_ns (which also has some adjustments to deal with
      multiple IOs dispatched to the driver at the same time). We can get
      pretty close to this just within HST, without exposing struct rq as a
      parameter, by duplicating the current block histogram stats in HST. (And
      also support bio based devices, which we don't need.)
      Effort: storage/iscsi
      Origin-8xx-SHA1: b547d652ceb3b7141ae343ee75eb524f7336bac6
      Signed-Off-By: default avatarKhazhismel Kumykov <khazhy@google.com>
      Change-Id: I7941df8c855a99e9a477ab416f664f3bbaa99050
    • Khazhismel Kumykov's avatar
      md: Expose struct request to path selector. · 3aa4bb62
      Khazhismel Kumykov authored
      This is to allow for access to metadata such as request start and
      end time, as well as service time and wait time.
      nr_bytes is retained for end_io as blk_rq_bytes represents the number
      of bytes *left* in a request, and is 0 after a request is finished.
      Upstream-Notes: this seems ugly, and probably not needed if HST just tracks io
        dispatch times by itself...
      Tested: Compiles
      ./dmsetup status - verified idle device had no inflights reported
      Rebase-Tested-9xx: Compiled
      dm-multipath was changed to support bio based targets as well as rq
      based targets, so instead of adding struct rq to existing start/end,
      create a new callback.
      Effort: storage/iscsi
      Google-Bug-Id: 33307640
      Origin-8xx-SHA1: f27a3212c0b609008eba9dd8919e4e20c93e2d1c
      Signed-Off-By: default avatarKhazhismel Kumykov <khazhy@google.com>
      Change-Id: I5adbe82b394d737fdc38108f580061a6a38be6e4
    • Frank Mayhar's avatar
      iscsi: Add support for asynchronous iSCSI session destruction · b12c5f44
      Frank Mayhar authored
      Add a new user event that triggers asynchronous iSCSI session destruction.
      This change will allow operations to take place on other sessions while
      the destroy process is proceeding, removing one of the major bottlenecks.
      Remove session from sesslist when calling DESTROY_SESSION_ASYNC. After
      removal the session can no longer be looked up in netlink, preventing
      opreations racing with destroy_work freeing the session.
      Tested: Currently untested, waiting for initiator changes.  This version
      is for early review.
      //fs/iscsi/... @("iscsi: Add support for
      asynchronous iSCSI session destruction")
      Failures expected due to partial rebase
      This commit is needed for tests.
      Google-Bug-Id: 17069624
      Google-Bug-Id: 14494008
      Effort: storage/iscsi
      Origin-8xx-SHA1: c12ea7326c7e2a0752abba61211fceea9971e5dd
      Change-Id: I8721705a3e33e6148ed89020796ab51c93c7dab6
      iscsi: synchronize asynchronous iSCSI session destruction
      Tested: //fs/iscsi/...
      http://sponge/10770df8-609c-4eff-bc65-8babea776cba - SMP
      http://sponge/f9af0ad5-1ffe-4869-aa10-7368e112a6bd - DBG
      Fixes: c12ea7326c7e ("iscsi: Add support for asynchronous iSCSI session
      Rebase-Tested-9xx: //fs/iscsi/...
      Effort: storage/iscsi
      Google-Bug-Id: 34284815
      Origin-8xx-SHA1: 22e0c4d03d7b0007d330f611994a96bfaa1e8879
      Signed-Off-By: default avatarFrank Mayhar <fmayhar@google.com>
      Signed-Off-By: default avatarKhazhismel Kumykov <khazhy@google.com>
      Change-Id: I38a977b78bce17ecff50359487b3510fb5f4bee5
    • Vaibhav Nagarnaik's avatar
      iscsi: Suppress session recovery messages for 0 timeout · 5456aab8
      Vaibhav Nagarnaik authored
      As part of normal shutdown, session recovery is suppressed. To do that,
      the timeout is set to 0. In this case, any session recovery related
      messages are not required to be printed.
      Tested: Built kernel and checked that session recovery messages don't
      get logged.
      Rebase-Tested-9xx: //fs/iscsi/...
      Effort: storage/iscsi
      Google-Bug-Id: 23757466
      Origin-8xx-SHA1: 6c166c776ef5ea0935b33bdd773d78de2d6c51b3
      Signed-Off-By: default avatarVaibhav Nagarnaik <vnagarnaik@google.com>
      Signed-Off-By: default avatarBharath Ravi <rbharath@google.com>
      Signed-Off-By: default avatarKhazhismel Kumykov <khazhy@google.com>
      Change-Id: I2b731fbb393cf6c970b4b4eca0cf107946224841
    • Tahsin Erdogan's avatar
      iscsi: do not try to call device_del() if device_add() never succeeded · 90f95ee0
      Tahsin Erdogan authored
      transport_add_device() is declared as void so any errors that occur are
      unknown to the caller. A later call to transport_remove_device() should
      make sure that device_del() is called only if the pairing device_add()
      was successful.
      This patch is only a workaround to the underlying problem. A more
      involved fix would be changing the return type of transport_add_device()
      and make sure the errors are propagated properly. This is currently beyond
      our scope.
        Injected a memory allocation error as explained in b/32028705. Verified
        that with this fix, it does not cause a kernel crash.
        kokonut //fs/iscsi:iscsi_test_suite:simple_dd
      Upstream-Plan: Reimplement to properly handle transport_add_device
      failure, most likely.
      Effort: storage/iscsi
      Google-Bug-Id: 32028705
      Origin-8xx-SHA1: dc94075b7bc470259a753e5539bb2f7ffe87599f
      Signed-Off-By: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-Off-By: default avatarKhazhismel Kumykov <khazhy@google.com>
      Change-Id: Ic2454d804a68927dc88e37bcd4259ad078c6011b
    • Anatol Pomazau's avatar
      dm mpath: Add timeout mechanism for queue_if_no_path. · 05631ccc
      Anatol Pomazau authored
      Add a configurable timeout mechanism to the queue_if_no_path function
      of multipath.  If set and if queue_if_no_path is set, the timeout is
      started when there are no active paths on a multipath device.  The
      timeout is reset if an active path is introduced or, of course, if a
      new table (and therefore a new multipath definition) is loaded.  If
      the timeout ever fires, the handler simply turns queue_if_no_path off.
      This allows I/O queued in multipath to be errored, possibly releasing
      locks and semaphores that may be being held waiting for that I/O to
      This mechanism is not turned on by default (the default timeout is zero).
      It can be turned on by setting
      (sets a timeout for all newly-created multipath instances) or by either
      adding the queue_if_no_path_timeout parameter to the table definition
      or sending the parameter via the DM message mechanism (sets a timeout
      for only that multipath instance; note that this doesn't survive
      a table reload unless the parameter is included in the new table).
      When turned on, this successfully avoids hang when a multipath device with
      queue_if_no_path and no viable paths.
      Tested: Run by hand against the testcase used to reproduce b/10525771.
      Google-Bug-Id: 8027821
      iscsi:  Enable queue_if_no_path timeout after loading paths, not before.
      If we enable the timeout in table load before loading the paths, it can
      fire even if we have good devices and turn off queue_if_no_path, meaning
      that users can get EIO unexpectedly.  Instead, only enable the timeout
      after loading the paths, if any.
      Also eliminate a potential memory leak in which we could fail to free
      the timer if we had an error during table load.
      Tested: By hand, created a session with a fifteen-second timeout, saw that
      queue_if_no_path stayed set after the timeout elapsed, killed all targets,
      saw that the timeout fired properly and the application got EIO.  Without
      this change, a session with a timeout will have queue_if_no_path reset
      when the timeout fires, whether or not there are valid paths.  This means
      that if giscsid dies and the kernel runs out of targets, EIO will happen
      instantly (as soon as the last target dies), instead of after the timeout.
      Google-Bug-Id: 11249942
      Origin-8xx-SHA1: 5e85400845f66fa1a88f22cfc9cfa5786d2375e9
      dm_mpath:  Restore del_timer_sync() call to multipath_dtr().
      Origin-8xx-SHA1: 3cbdafad2c91c6286622a24d7a8bccbb89fb5c1b
      dm-mpath: DMWARN on queue-if-no-path timeout
      Google-Bug-Id: 27678872
      Origin-8xx-SHA1: ee10da1694a51ad9e6662f7abf816afe7d9d03f3
      The timer_list API has changed where keeping it as a pointer on the
      struct is awkward to use. Re-implement as a normal member. Instead of
      relying on nopath_timer == NULL, just mod/del_timer when
      setup_timer() only needs to be called on ctr. Call mod_timer whenever we
      want to start the timer, and del_timer(_sync) to stop it.
      queue_if_no_path_timeout now passes
      Effort: storage/iscsi
      Signed-Off-By: default avatarFrank Mayhar <fmayhar@google.com>
      Signed-Off-By: default avatarBharath Ravi <rbharath@google.com>
      Signed-Off-By: default avatarAnatol Pomazau <anatol@google.com>
      Signed-Off-By: default avatarKhazhismel Kumykov <khazhy@google.com>
      Change-Id: Iab4e07ba20b2b20044e57c3f535090e392f1af7e
    • Bharath Ravi's avatar
      iscsi: Perform connection failure entirely in the kernel. · 3b1848c7
      Bharath Ravi authored
      Connection failure processing depends on a daemon being present to (at
      least) stop the connection and start recovery.  This removes that
      dependency by stopping the connection in the kernel and performing
      recovery timeout processing immediately, thereby failing the SCSI
      Upstream-Notes: this introduces some "fun" locking changes
      Tested: Kokonut tests in //fs/iscsi/...
      Google-Bug-Id: 8448740
      Origin-8xx-SHA1: 2ad9cf32dedff585cf0960467165e58529aac6ab
      iscsi: Prevent races between kernel and userspace operations.
      Prevent races between kernel connection failure handling and
      daemon operations. The following patches add a new mutex to prevent
      this but subsequently refactor the code to remove the mutex.
      Tested: Ran //fs/iscsi/... on dbg and non-dbg kernels, no
      issues in either. A couple of tests fail on dbg kernels for unrelated
      dbg: http://sponge/d2353056-540b-4a64-a6a7-1787705171e6
      smp: http://sponge/eb7d4290-b742-43a2-9c56-a91b78e56d22
      Origin-8xx-SHA1: 550f43a59925283a605d9c1e9343c7cb752ff96c
      iscsi: Fix deadlock in iscsi_sw_tcp_release_conn
      Deadlock was caused by lock ordering between session->frwd_lock and
      There are several places where we must grab sk_callback_lock, and then grab
      frwd_lock, due to having only a struct sock and having to traverse up via
       - iscsi_sw_tcp_data_ready
       - iscsi_sw_tcp_state_change
      There are several places where we must grab frwd_lock, and then grab
      sk_callback_lock, due to only having a struct iscsi_conn and having to
      traverse down via iscsi_conn->iscsi_tcp_conn->sock
       - iscsi_sw_tcp_conn_stop
       - iscsi_sw_tcp_conn_destroy
      A previous solution in 7xx was to add a lock to iscsi_tcp_conn protecting
      the sock member. This still does not give us a single locking order,
      although the lockup did seem unlikely.
       - in iscsi_sw_tcp_data_ready/state_change, we never have to grab this
          lock. We would grab sk_callback_lock, frwd_lock
       - in iscsi_sw_tcp_conn_stop/destroy we would have to grab the lock. We
          would grab tcp_sw_conn->lock, sk_callback_lock.
       - in iscsi_sw_tcp_host_get_param we need to grab frwd_lock, and then
      Another solution which opted for here is in iscsi_sw_tcp_release_conn
      (used by conn_stop/destroy) we grab frwd_lock, get our data member,
      release and regrab in order. Since iscsi_sw_tcp_release_conn must never
      be called simultaneously on the same connection, and it is responsible
      for releasing the connection, we do not have to handle that case. The
      function is called by iscsi_sw_tcp_conn_destroy and
      iscsi_sw_tcp_conn_stop, which are both in all cases called with
      __rx_queue_mutex held.
      Inserted completions into kernel to force the offending areas to
      deadlock, tested with and without fix.
      Google-Bug-Id: 34359890
      Origin-8xx-SHA1: 0b8e182b33bf22ebeb2b9dd91166572476cde5bc
      Origin-8xx-SHA1: 0e2af66c83ac947540c6eeade6f17b79d7edd4ed
      iscsi: eliminate connection failure work queue
      Currently, submitting work to iscsi_conn_failure_workq involves
      allocating a work_struct. This operation is subject to failure in which
      case the connection remains stuck indefinitely.
      This patch eliminates memory allocation by introducing a single global
      work_struct which does the work for all the connections. By consolidating
      all the failure handling in a single work item, the need for having a
      dedicated work queue also goes away.
        kokonut //fs/iscsi/... //fs/xfstests/pdiscsi/...
      Google-Bug-Id: 33495023
      Origin-8xx-SHA1: bca95f98da8bc004c4084aba8e22613b002805bd
      iscsi: suppress __GFP_IO flag while holding rx_queue_mutex lock
      Waiting for pending io requests to complete in dm-bufio shrinker is not safe.
      This is because, the task that invokes shrinker may be holding
      rx_queue_mutex and completion of the pending io request may depend on code that
      needs to acquire the named mutex.
        Reproduced the deadlock by arranging the following setup.
        Added a custom entry to kernel that grabs rx_queue_mutex and tries
        to allocate a lot of memory in order to trigger mm shrinking.
        In a modified kokonut test, specified iscsi_test_debug_forever_hang_sector0_read
        flag to the iscsi target app so that it does not respond to the read block
        Once the io is pending, invoked the custom kernel code to start
        These steps caused a deadlock without this patch. Tried the same steps
        with the patch and verified that deadlock does not occur.
        Google3 and kernel patches for the repro are saved here:
        kokonut //fs/iscsi/... //fs/xfstests/pdiscsi/... //mm/...
      Origin-8xx-SHA1: 9556060ce3dab0d54dc5fa5c0bdc99714c30fd5b
      Google-Bug-Id: 28317742
      //fs/iscsi/... @("iscsi: Add support for
      asynchronous iSCSI session destruction")
      Effort: storage/iscsi
      Signed-Off-By: default avatarDave Clausen <dclausen@google.com>
      Signed-Off-By: default avatarNick Black <nlb@google.com>
      Signed-Off-By: default avatarVaibhav Nagarnaik <vnagarnaik@google.com>
      Signed-Off-By: default avatarAnatol Pomazau <anatol@google.com>
      Signed-Off-By: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-Off-By: default avatarFrank Mayhar <fmayhar@google.com>
      Signed-Off-By: default avatarJunho Ryu <jayr@google.com>
      Signed-Off-By: default avatarBharath Ravi <rbharath@google.com>
      Signed-Off-By: default avatarKhazhismel Kumykov <khazhy@google.com>
      Change-Id: I30a12b54ff1f1db0e71fc03034c4f9935781781a
    • Junho Ryu's avatar
      iscsi: Add conn_err connection error status; export via sysfs. · dbe90595
      Junho Ryu authored
      If an iSCSI connection happens to fail while the daemon (e.g. open-iscsi)
      isn't running (due to a crash or for another reason), the kernel failure
      report is dropped.  There is insufficient kernel state in sysfs when the
      daemon restarts for it to know that this happened.  The connection remains
      hung and the only way to recover is to create a new connection to the
      target, leaving the old block device dangling.
      This change adds a new field to iscsi_cls_conn, conn_err, that indicates
      that the connection had an error.  It's set in iscsi_conn_error_event()
      and reset when a connection-related uevent arrives.
      Upstream-Notes: this may get some push-back without userspace support
      Google-Bug-Id: 8531674
      Tested: Kokonut tests in //fs/iscsi/...
      //fs/iscsi/... @("iscsi: Add support for
      asynchronous iSCSI session destruction")
      Effort: storage/iscsi
      Origin-8xx-SHA1: e05236e9d2ad4faec6981c8dc7bf87a07cd26d04
      Signed-Off-By: default avatarFrank Mayhar <fmayhar@google.com>
      Signed-Off-By: default avatarBharath Ravi <rbharath@google.com>
      Signed-Off-By: default avatarJunho Ryu <jayr@google.com>
      Signed-Off-By: default avatarKhazhismel Kumykov <khazhy@google.com>
      Change-Id: Ia0d46435a93fea2aeb302872b990c97ed7861116
    • Tahsin Erdogan's avatar
      iscsi: do not wait for ios in dm shrinker · 996def3b
      Tahsin Erdogan authored
      If userspace iscsi daemon destroys a failed session with pending IOs, there is
      a potential deadlock with dm shrinker (invoked with __GFP_IO) - the daemon
      requests an allocation, which calls the shrinker, which waits for an iSCSI IO
      to complete, which is waiting on the iscsi daemon for recovery.
      Fixes: dc60f51f075 ("iscsi: Perform connection failure entirely in the kernel.")
      	Modified kernel to call dm shrinker during a dm_suspend()
      	call similar to the call stack above.
      	Deadlock repro consists of 3 giscsi sessions. We start three
      	iscsi targets, one for each session.
      	Session1's target is configured to not respond to read requests.
      	This simulates a stuck io that does not complete. Then we run:
      		dd if=/dev/mapper/verity-1 of=/dev/null bs=512 count=1
      	Session2 starts with a single target, then we add another target
      	so that it reconfigures its dm mapping and so ends up calling
      	dev_suspend. Modified kernel recognizes the thread that calls
      	kobject_uevent_env() from dev_suspend and invokes dm shrinker.
      	Shrinker blocks on the io that is stuck from session1.
      	Then we add a second target to session3. This causes session3 to
      	add a new device and call kobject_uevent_env(). This call blocks
      	because session2's thread has uevent_sock_mutex locked.
      	Finally, a timer fires and kernel decides that connection is
      	bad. iscsi_conn_failure thread tries to acquire __rx_queue_mutex
      	but it can't because the mutex is lock by session3 thread.
      	The setup successfully locked up a system without this patch.
      	After applying the patch, system was able to fail the connection
      Google-Bug-Id: 23816714
      Effort: storage/iscsi
      Origin-8xx-SHA1: f3aab6acba07fe35461604809ee1bb3d2750ba3a
      Signed-Off-By: default avatarKhazhismel Kumykov <khazhy@google.com>
      Signed-Off-By: default avatarTahsin Erdogan <tahsin@google.com>
      Change-Id: Ieed6ea04db5fdeacbdca88223b82e64a88534386
    • Anatol Pomazau's avatar
      iscsi: Don't crash if the daemon didn't bind the connection. · 1651298e
      Anatol Pomazau authored
      If the iSCSI daemon fails to bind the socket to the iSCSI connection,
      a subsequent send_pdu will crash the kernel due to a null pointer
      dereference in sock_sendmsg().  Avoid this by checking for a null
      socket pointer in iscsi_sw_tcp_pdu_xmit().
      Tested: By hand.  Ran problematic test, no crash.  Ran regular iSCSI
      test, worked fine.
      //fs/iscsi/... @("iscsi: Add support for
      asynchronous iSCSI session destruction")
      Google-Bug-Id: 10001974
      Effort: storage/iscsi
      Origin-8xx-SHA1: 37b46182a124b1837eeb63790d3f8dd28828ddda
      iscsi: Prevent races between kernel and userspace operations.
      Related-8xx-SHA1: 550f43a59925283a605d9c1e9343c7cb752ff96c
      Signed-Off-By: default avatarFrank Mayhar <fmayhar@google.com>
      Signed-Off-By: default avatarBharath Ravi <rbharath@google.com>
      Signed-Off-By: default avatarAnatol Pomazau <anatol@google.com>
      Signed-Off-By: default avatarKhazhimsel Kumykov <khazhy@google.com>
      Change-Id: Id125d97d1fc08a98e9d03735fea36c5a0d3cbcb2
    • Nick Black's avatar
      iscsi: Don't destroy session if there are conns · a100bca2
      Nick Black authored
      Previous to this patch, destroying sessions without destroying all
      the connections resulted in the kernel corrupting slab.  This patch
      fixes that without introducing memory leaks by making sure that
      if a request to destroy the session is made, the connections are
      already destroyed.
      Google-Bug-Id: 7386923
      Tested: Kokonut tests in //fs/iscsi/...
      //fs/iscsi/... @("iscsi: Add support for
      asynchronous iSCSI session destruction")
      Origin-8xx-SHA1: 43cadbd4fe83fd10333e06e0b350c43da663432e
      Effort: storage/iscsi
      Signed-Off-By: default avatarSalman Qazi <sqazi@google.com>
      Signed-Off-By: default avatarJunho Ryu <jayr@google.com>
      Signed-Off-By: default avatarNick Black <nlb@google.com>
      Signed-Off-By: default avatarKhazhismel Kumykov <khazhy@google.com>
      Change-Id: Ifa4505d16a333a0b729988f5c5abb77ba1d3bc7b
  2. 28 Jan, 2018 8 commits
  3. 27 Jan, 2018 2 commits
    • Thomas Gleixner's avatar
      hrtimer: Reset hrtimer cpu base proper on CPU hotplug · d5421ea4
      Thomas Gleixner authored
      The hrtimer interrupt code contains a hang detection and mitigation
      mechanism, which prevents that a long delayed hrtimer interrupt causes a
      continous retriggering of interrupts which prevent the system from making
      progress. If a hang is detected then the timer hardware is programmed with
      a certain delay into the future and a flag is set in the hrtimer cpu base
      which prevents newly enqueued timers from reprogramming the timer hardware
      prior to the chosen delay. The subsequent hrtimer interrupt after the delay
      clears the flag and resumes normal operation.
      If such a hang happens in the last hrtimer interrupt before a CPU is
      unplugged then the hang_detected flag is set and stays that way when the
      CPU is plugged in again. At that point the timer hardware is not armed and
      it cannot be armed because the hang_detected flag is still active, so
      nothing clears that flag. As a consequence the CPU does not receive hrtimer
      interrupts and no timers expire on that CPU which results in RCU stalls and
      other malfunctions.
      Clear the flag along with some other less critical members of the hrtimer
      cpu base to ensure starting from a clean state when a CPU is plugged in.
      Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
      root cause of that hard to reproduce heisenbug. Once understood it's
      trivial and certainly justifies a brown paperbag.
      Fixes: 41d2e494 ("hrtimer: Tune hrtimer_interrupt hang logic")
      Reported-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Sewior <bigeasy@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
    • H. Peter Anvin's avatar
      x86: Mark hpa as a "Designated Reviewer" for the time being · 8a95b74d
      H. Peter Anvin authored
      Due to some unfortunate events, I have not been directly involved in
      the x86 kernel patch flow for a while now.  I have also not been able
      to ramp back up by now like I had hoped to, and after reviewing what I
      will need to work on both internally at Intel and elsewhere in the near
      term, it is clear that I am not going to be able to ramp back up until
      late 2018 at the very earliest.
      It is not acceptable to not recognize that this load is currently
      taken by Ingo and Thomas without my direct participation, so I mark
      myself as R: (designated reviewer) rather than M: (maintainer) until
      further notice.  This is in fact recognizing the de facto situation
      for the past few years.
      I have obviously no intention of going away, and I will do everything
      within my power to improve Linux on x86 and x86 for Linux.  This,
      however, puts credit where it is due and reflects a change of focus.
      This patch also removes stale entries for portions of the x86
      architecture which have not been maintained separately from arch/x86
      for a long time.  If there is a reason to re-introduce them then that
      can happen later.
      Signed-off-by: default avatarH. Peter Anvin <h.peter.anvin@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Bruce Schlobohm <bruce.schlobohm@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20180125195934.5253-1-hpa@zytor.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
  4. 26 Jan, 2018 13 commits
  5. 25 Jan, 2018 6 commits
    • Lyude Paul's avatar
      drm/nouveau: Move irq setup/teardown to pci ctor/dtor · 0fd189a9
      Lyude Paul authored
      For a while we've been having issues with seemingly random interrupts
      coming from nvidia cards when resuming them. Originally the fix for this
      was thought to be just re-arming the MSI interrupt registers right after
      re-allocating our IRQs, however it seems a lot of what we do is both
      wrong and not even nessecary.
      This was made apparent by what appeared to be a regression in the
      mainline kernel that started introducing suspend/resume issues for
              a0c9259d (irq/matrix: Spread interrupts on allocation)
      After this commit was introduced, we started getting interrupts from the
      GPU before we actually re-allocated our own IRQ (see references below)
      and assigned the IRQ handler. Investigating this turned out that the
      problem was not with the commit, but the fact that nouveau even
      free/allocates it's irqs before and after suspend/resume.
      For starters: drivers in the linux kernel haven't had to handle
      freeing/re-allocating their IRQs during suspend/resume cycles for quite
      a while now. Nouveau seems to be one of the few drivers left that still
      does this, despite the fact there's no reason we actually need to since
      disabling interrupts from the device side should be enough, as the
      kernel is already smart enough to know to disable host-side interrupts
      for us before going into suspend. Since we were tearing down our IRQs by
      hand however, that means there was a short period during resume where
      interrupts could be received before we re-allocated our IRQ which would
      lead to us getting an unhandled IRQ. Since we never handle said IRQ and
      re-arm the interrupt registers, this would cause us to miss all of the
      interrupts from the GPU and cause our init process to start timing out
      on anything requiring interrupts.
      So, since this whole setup/teardown every suspend/resume cycle is
      useless anyway, move irq setup/teardown into the pci subdev's ctor/dtor
      functions instead so they're only called at driver load and driver
      unload. This should fix most of the issues with pending interrupts on
      resume, along with getting suspend/resume for nouveau to work again.
      As well, this probably means we can also just remove the msi rearm call
      inside nvkm_pci_init(). But since our main focus here is to fix
      suspend/resume before 4.15, we'll save that for a later patch.
      Signed-off-by: default avatarLyude Paul <lyude@redhat.com>
      Cc: Karol Herbst <kherbst@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
    • Nicolas Dichtel's avatar
      net: don't call update_pmtu unconditionally · f15ca723
      Nicolas Dichtel authored
      Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
      "BUG: unable to handle kernel NULL pointer dereference at           (null)"
      Let's add a helper to check if update_pmtu is available before calling it.
      Fixes: 52a589d5 ("geneve: update skb dst pmtu on tx path")
      Fixes: a93bf0ff ("vxlan: update skb dst pmtu on tx path")
      CC: Roman Kapl <code@rkapl.cz>
      CC: Xin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 6e20630e
      Linus Torvalds authored
      Pull KVM fixes from Radim Krčmář:
       "Fix races and a potential use after free in the s390 cmma migration
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: s390: add proper locking for CMMA migration bitmap
    • Linus Torvalds's avatar
      Merge tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 525273fb
      Linus Torvalds authored
      Pull btrfs fix from David Sterba:
       "It's been reported recently that readdir can list stale entries under
        some conditions. Fix it."
      * tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        Btrfs: fix stale entries in readdir
    • Dan Streetman's avatar
      net: tcp: close sock if net namespace is exiting · 4ee806d5
      Dan Streetman authored
      When a tcp socket is closed, if it detects that its net namespace is
      exiting, close immediately and do not wait for FIN sequence.
      For normal sockets, a reference is taken to their net namespace, so it will
      never exit while the socket is open.  However, kernel sockets do not take a
      reference to their net namespace, so it may begin exiting while the kernel
      socket is still open.  In this case if the kernel socket is a tcp socket,
      it will stay open trying to complete its close sequence.  The sock's dst(s)
      hold a reference to their interface, which are all transferred to the
      namespace's loopback interface when the real interfaces are taken down.
      When the namespace tries to take down its loopback interface, it hangs
      waiting for all references to the loopback interface to release, which
      results in messages like:
      unregister_netdevice: waiting for lo to become free. Usage count = 1
      These messages continue until the socket finally times out and closes.
      Since the net namespace cleanup holds the net_mutex while calling its
      registered pernet callbacks, any new net namespace initialization is
      blocked until the current net namespace finishes exiting.
      After this change, the tcp socket notices the exiting net namespace, and
      closes immediately, releasing its dst(s) and their reference to the
      loopback interface, which lets the net namespace continue exiting.
      Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811Signed-off-by: default avatarDan Streetman <ddstreet@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Peter Zijlstra's avatar
      perf/x86: Fix perf,x86,cpuhp deadlock · efe951d3
      Peter Zijlstra authored
      More lockdep gifts, a 5-way lockup race:
       #0		    mutex_lock(&pmc_reserve_mutex);
       #1		      get_online_cpus()
       #0		mutex_lock(&pmc_reserve_mutex)
       #1		  get_online_cpus()
       #1	do_cpu_up()
       #2	    mutex_lock(&pmus_lock)
       #3	    mutex_lock(&ctx->mutex)
       #3	    mutex_lock(ctx->mutex)
       #4	    mutex_lock_nested(ctx->mutex, 1);
       #4	  mutex_lock_nested(ctx->mutex, 1)
       #0		mutex_lock(&pmc_reserve_mutex)
      Fix it by using ordering constructs instead of locking.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>