1. 09 Sep, 2017 2 commits
    • Matthias Kaehlcke's avatar
      mm/zsmalloc.c: change stat type parameter to int · 3eb95fea
      Matthias Kaehlcke authored
      zs_stat_inc/dec/get() uses enum zs_stat_type for the stat type, however
      some callers pass an enum fullness_group value.  Change the type to int to
      reflect the actual use of the functions and get rid of 'enum-conversion'
      warnings
      
      Link: http://lkml.kernel.org/r/20170731175000.56538-1-mka@chromium.orgSigned-off-by: default avatarMatthias Kaehlcke <mka@chromium.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Doug Anderson <dianders@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eb95fea
    • Jérôme Glisse's avatar
      mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY · 2916ecc0
      Jérôme Glisse authored
      Introduce a new migration mode that allow to offload the copy to a device
      DMA engine.  This changes the workflow of migration and not all
      address_space migratepage callback can support this.
      
      This is intended to be use by migrate_vma() which itself is use for thing
      like HMM (see include/linux/hmm.h).
      
      No additional per-filesystem migratepage testing is needed.  I disables
      MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
      added comment in those to explain why (part of this patch).  The commit
      message is unclear it should say that any callback that wish to support
      this new mode need to be aware of the difference in the migration flow
      from other mode.
      
      Some of these callbacks do extra locking while copying (aio, zsmalloc,
      balloon, ...) and for DMA to be effective you want to copy multiple
      pages in one DMA operations.  But in the problematic case you can not
      easily hold the extra lock accross multiple call to this callback.
      
      Usual flow is:
      
      For each page {
       1 - lock page
       2 - call migratepage() callback
       3 - (extra locking in some migratepage() callback)
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
       5 - copy page
       6 - (unlock any extra lock of migratepage() callback)
       7 - return from migratepage() callback
       8 - unlock page
      }
      
      The new mode MIGRATE_SYNC_NO_COPY:
       1 - lock multiple pages
      For each page {
       2 - call migratepage() callback
       3 - abort in all problematic migratepage() callback
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
      } // finished all calls to migratepage() callback
       5 - DMA copy multiple pages
       6 - unlock all the pages
      
      To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
      new callback migratepages() (for instance) that deals with multiple
      pages in one transaction.
      
      Because the problematic cases are not important for current usage I did
      not wanted to complexify this patchset even more for no good reason.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2916ecc0
  2. 07 Sep, 2017 1 commit
    • Hui Zhu's avatar
      zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse · 77ff4657
      Hui Zhu authored
      Getting -EBUSY from zs_page_migrate will make migration slow (retry) or
      fail (zs_page_putback will schedule_work free_work, but it cannot ensure
      the success).
      
      I noticed this issue because my Kernel patched
      (https://lkml.org/lkml/2014/5/28/113) that will remove retry in
      __alloc_contig_migrate_range.
      
      This retry will handle the -EBUSY because it will re-isolate the page
      and re-call migrate_pages.  Without it will make cma_alloc fail at once
      with -EBUSY.
      
      According to the review from Minchan Kim in
      https://lkml.org/lkml/2014/5/28/113, I update the patch to skip
      unnecessary loops but not return -EBUSY if zspage is not inuse.
      
      Following is what I got with highalloc-performance in a vbox with 2 cpu
      1G memory 512 zram as swap.  And the swappiness is set to 100.
      
                                         ori          ne
                                        orig         new
      Minor Faults                  50805113    50830235
      Major Faults                     43918       56530
      Swap Ins                         42087       55680
      Swap Outs                        89718      104700
      Allocation stalls                    0           0
      DMA allocs                       57787       52364
      DMA32 allocs                  47964599    48043563
      Normal allocs                        0           0
      Movable allocs                       0           0
      Direct pages scanned             45493       23167
      Kswapd pages scanned           1565222     1725078
      Kswapd pages reclaimed         1342222     1503037
      Direct pages reclaimed           45615       25186
      Kswapd efficiency                  85%         87%
      Kswapd velocity               1897.101    1949.042
      Direct efficiency                 100%        108%
      Direct velocity                 55.139      26.175
      Percentage direct scans             2%          1%
      Zone normal velocity          1952.240    1975.217
      Zone dma32 velocity              0.000       0.000
      Zone dma velocity                0.000       0.000
      Page writes by reclaim       89764.000  105233.000
      Page writes file                    46         533
      Page writes anon                 89718      104700
      Page reclaim immediate           21457        3699
      Sector Reads                   3259688     3441368
      Sector Writes                  3667252     3754836
      Page rescued immediate               0           0
      Slabs scanned                  1042872     1160855
      Direct inode steals               8042       10089
      Kswapd inode steals              54295       29170
      Kswapd skipped wait                  0           0
      THP fault alloc                    175         154
      THP collapse alloc                 226         289
      THP splits                           0           0
      THP fault fallback                  11          14
      THP collapse fail                    3           2
      Compaction stalls                  536         646
      Compaction success                 322         358
      Compaction failures                214         288
      Page migrate success            119608      111063
      Page migrate failure              2723        2593
      Compaction pages isolated       250179      232652
      Compaction migrate scanned     9131832     9942306
      Compaction free scanned        2093272     2613998
      Compaction cost                    192         189
      NUMA alloc hit                47124555    47193990
      NUMA alloc miss                      0           0
      NUMA interleave hit                  0           0
      NUMA alloc local              47124555    47193990
      NUMA base PTE updates                0           0
      NUMA huge PMD updates                0           0
      NUMA page range updates              0           0
      NUMA hint faults                     0           0
      NUMA hint local faults               0           0
      NUMA hint local percent            100         100
      NUMA pages migrated                  0           0
      AutoNUMA cost                       0%          0%
      
      [akpm@linux-foundation.org: remove newline, per Minchan]
      Link: http://lkml.kernel.org/r/1500889535-19648-1-git-send-email-zhuhui@xiaomi.comSigned-off-by: default avatarHui Zhu <zhuhui@xiaomi.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77ff4657
  3. 02 Aug, 2017 1 commit
    • Minchan Kim's avatar
      zram: do not free pool->size_class · 3189c820
      Minchan Kim authored
      Mike reported kernel goes oops with ltp:zram03 testcase.
      
        zram: Added device: zram0
        zram0: detected capacity change from 0 to 107374182400
        BUG: unable to handle kernel paging request at 0000306d61727a77
        IP: zs_map_object+0xb9/0x260
        PGD 0
        P4D 0
        Oops: 0000 [#1] SMP
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Modules linked in: zram(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) loop(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) af_packet(E) br_netfilter(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_powerclamp(E) coretemp(E) cdc_ether(E) kvm_intel(E) usbnet(E) mii(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) iTCO_wdt(E) ghash_clmulni_intel(E) bnx2(E) iTCO_vendor_support(E) pcbc(E) ioatdma(E) ipmi_ssif(E) aesni_intel(E) i5500_temp(E) i2c_i801(E) aes_x86_64(E) lpc_ich(E) shpchp(E) mfd_core(E) crypto_simd(E) i7core_edac(E) dca(E) glue_helper(E) cryptd(E) ipmi_si(E) button(E) acpi_cpufreq(E) ipmi_devintf(E) pcspkr(E) ipmi_msghandler(E)
         nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ext4(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) ata_generic(E) i2c_algo_bit(E) ata_piix(E) drm_kms_helper(E) ahci(E) syscopyarea(E) sysfillrect(E) libahci(E) sysimgblt(E) fb_sys_fops(E) uhci_hcd(E) ehci_pci(E) ttm(E) ehci_hcd(E) libata(E) drm(E) megaraid_sas(E) usbcore(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E) [last unloaded: zram]
        CPU: 6 PID: 12356 Comm: swapon Tainted: G            E   4.13.0.g87b2c3fc-default #194
        Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698     , BIOS -[D6E150AUS-1.10]- 12/15/2010
        task: ffff880158d2c4c0 task.stack: ffffc90001680000
        RIP: 0010:zs_map_object+0xb9/0x260
        Call Trace:
         zram_bvec_rw.isra.26+0xe8/0x780 [zram]
         zram_rw_page+0x6e/0xa0 [zram]
         bdev_read_page+0x81/0xb0
         do_mpage_readpage+0x51a/0x710
         mpage_readpages+0x122/0x1a0
         blkdev_readpages+0x1d/0x20
         __do_page_cache_readahead+0x1b2/0x270
         ondemand_readahead+0x180/0x2c0
         page_cache_sync_readahead+0x31/0x50
         generic_file_read_iter+0x7e7/0xaf0
         blkdev_read_iter+0x37/0x40
         __vfs_read+0xce/0x140
         vfs_read+0x9e/0x150
         SyS_read+0x46/0xa0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        Code: 81 e6 00 c0 3f 00 81 fe 00 00 16 00 0f 85 9f 01 00 00 0f b7 13 65 ff 05 5e 07 dc 7e 66 c1 ea 02 81 e2 ff 01 00 00 49 8b 54 d4 08 <8b> 4a 48 41 0f af ce 81 e1 ff 0f 00 00 41 89 c9 48 c7 c3 a0 70
        RIP: zs_map_object+0xb9/0x260 RSP: ffffc90001683988
        CR2: 0000306d61727a77
      
      He bisected the problem is [1].
      
      After commit cf8e0fed ("mm/zsmalloc: simplify zs_max_alloc_size
      handling"), zram doesn't use double pointer for pool->size_class any
      more in zs_create_pool so counter function zs_destroy_pool don't need to
      free it, either.
      
      Otherwise, it does kfree wrong address and then, kernel goes Oops.
      
      Link: http://lkml.kernel.org/r/20170725062650.GA12134@bbox
      Fixes: cf8e0fed ("mm/zsmalloc: simplify zs_max_alloc_size handling")
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3189c820
  4. 10 Jul, 2017 2 commits
  5. 14 Apr, 2017 1 commit
  6. 02 Mar, 2017 1 commit
  7. 25 Feb, 2017 2 commits
  8. 23 Feb, 2017 1 commit
  9. 01 Dec, 2016 1 commit
  10. 28 Jul, 2016 8 commits
  11. 26 Jul, 2016 11 commits
  12. 26 May, 2016 1 commit
  13. 21 May, 2016 6 commits
  14. 10 May, 2016 1 commit
    • Sergey Senozhatsky's avatar
      zsmalloc: fix zs_can_compact() integer overflow · 44f43e99
      Sergey Senozhatsky authored
      zs_can_compact() has two race conditions in its core calculation:
      
      unsigned long obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) -
      				zs_stat_get(class, OBJ_USED);
      
      1) classes are not locked, so the numbers of allocated and used
         objects can change by the concurrent ops happening on other CPUs
      2) shrinker invokes it from preemptible context
      
      Depending on the circumstances, thus, OBJ_ALLOCATED can become
      less than OBJ_USED, which can result in either very high or
      negative `total_scan' value calculated later in do_shrink_slab().
      
      do_shrink_slab() has some logic to prevent those cases:
      
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-64
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
      
      However, due to the way `total_scan' is calculated, not every
      shrinker->count_objects() overflow can be spotted and handled.
      To demonstrate the latter, I added some debugging code to do_shrink_slab()
      (x86_64) and the results were:
      
       vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615]
       vmscan: but total_scan > 0: 92679974445502
       vmscan: resulting total_scan: 92679974445502
      [..]
       vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615]
       vmscan: but total_scan > 0: 22634041808232578
       vmscan: resulting total_scan: 22634041808232578
      
      Even though shrinker->count_objects() has returned an overflowed value,
      the resulting `total_scan' is positive, and, what is more worrisome, it
      is insanely huge. This value is getting used later on in
      shrinker->scan_objects() loop:
      
              while (total_scan >= batch_size ||
                     total_scan >= freeable) {
                      unsigned long ret;
                      unsigned long nr_to_scan = min(batch_size, total_scan);
      
                      shrinkctl->nr_to_scan = nr_to_scan;
                      ret = shrinker->scan_objects(shrinker, shrinkctl);
                      if (ret == SHRINK_STOP)
                              break;
                      freed += ret;
      
                      count_vm_events(SLABS_SCANNED, nr_to_scan);
                      total_scan -= nr_to_scan;
      
                      cond_resched();
              }
      
      `total_scan >= batch_size' is true for a very-very long time and
      'total_scan >= freeable' is also true for quite some time, because
      `freeable < 0' and `total_scan' is large enough, for example,
      22634041808232578. The only break condition, in the given scheme of
      things, is shrinker->scan_objects() == SHRINK_STOP test, which is a
      bit too weak to rely on, especially in heavy zsmalloc-usage scenarios.
      
      To fix the issue, take a pool stat snapshot and use it instead of
      racy zs_stat_get() calls.
      
      Link: http://lkml.kernel.org/r/20160509140052.3389-1-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>        [4.3+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44f43e99
  15. 17 Mar, 2016 1 commit
    • Sergey Senozhatsky's avatar
      mm/zsmalloc: add `freeable' column to pool stat · 1120ed54
      Sergey Senozhatsky authored
      Add a new column to pool stats, which will tell how many pages ideally
      can be freed by class compaction, so it will be easier to analyze
      zsmalloc fragmentation.
      
      At the moment, we have only numbers of FULL and ALMOST_EMPTY classes,
      but they don't tell us how badly the class is fragmented internally.
      
      The new /sys/kernel/debug/zsmalloc/zramX/classes output look as follows:
      
       class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
      [..]
          12   224           0            2           146          5          8                4        4
          13   240           0            0             0          0          0                1        0
          14   256           1           13          1840       1672        115                1       10
          15   272           0            0             0          0          0                1        0
      [..]
          49   816           0            3           745        735        149                1        2
          51   848           3            4           361        306         76                4        8
          52   864          12           14           378        268         81                3       21
          54   896           1           12           117         57         26                2       12
          57   944           0            0             0          0          0                3        0
      [..]
       Total                26          131         12709      10994       1071                       134
      
      For example, from this particular output we can easily conclude that
      class-896 is heavily fragmented -- it occupies 26 pages, 12 can be freed
      by compaction.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1120ed54