1. 09 Sep, 2015 3 commits
  2. 03 Sep, 2015 1 commit
    • Pekka Paalanen's avatar
      test: add fence-image-self-test · 07006853
      Pekka Paalanen authored
      
      
      Tests that fence_malloc and fence_image_create_bits actually work: that
      out-of-bounds and out-of-row (unused stride area) accesses trigger
      SIGSEGV.
      
      If fence_malloc is a dummy (FENCE_MALLOC_ACTIVE not defined), this test
      is skipped.
      
      Changes in v2:
      
      - check FENCE_MALLOC_ACTIVE value, not whether it is defined
      - test that reading bytes near the fence pages does not cause a
        segmentation fault
      
      Changes in v3:
      
      - Do not print progress messages unless VERBOSE environment variable is
        set. Avoid spamming the terminal output of 'make check' on some
        versions of autotools.
      Signed-off-by: Pekka Paalanen's avatarPekka Paalanen <pekka.paalanen@collabora.co.uk>
      Reviewed-by: default avatarBen Avison <bavison@riscosopen.org>
      07006853
  3. 01 Sep, 2015 2 commits
  4. 28 Aug, 2015 1 commit
  5. 18 Aug, 2015 1 commit
  6. 01 Aug, 2015 2 commits
  7. 16 Jul, 2015 11 commits
    • Oded Gabbay's avatar
      vmx: implement fast path iterator vmx_fetch_a8 · 8d9be361
      Oded Gabbay authored
      no changes were observed when running cairo trimmed benchmarks.
      
      Running "lowlevel-blt-bench src_8_8888" on POWER8, 8 cores,
      3.4GHz, RHEL 7.1 ppc64le gave the following results:
      
      reference memcpy speed = 25197.2MB/s (6299.3MP/s for 32bpp fills)
      
                      Before          After           Change
                    --------------------------------------------
      L1              965.34          3936           +307.73%
      L2              942.99          3436.29        +264.40%
      M               902.24          2757.77        +205.66%
      HT              448.46          784.99         +75.04%
      VT              430.05          819.78         +90.62%
      R               412.9           717.04         +73.66%
      RT              168.93          220.63         +30.60%
      Kops/s          1025            1303           +27.12%
      
      It was benchmarked against commid id e2d211ac
      
       from pixman/master
      
      Siarhei Siamashka reported that on playstation3, it shows the following
      results:
      
      == before ==
      
                    src_8_8888 =  L1: 194.37  L2: 198.46  M:155.90 (148.35%)
                    HT: 59.18  VT: 36.71  R: 38.93  RT: 12.79 ( 106Kops/s)
      
      == after ==
      
                    src_8_8888 =  L1: 373.96  L2: 391.10  M:245.81 (233.88%)
                    HT: 80.81  VT: 44.33  R: 48.10  RT: 14.79 ( 122Kops/s)
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      8d9be361
    • Oded Gabbay's avatar
      vmx: implement fast path iterator vmx_fetch_x8r8g8b8 · 47f74ca9
      Oded Gabbay authored
      It was benchmarked against commid id 2be523b2
      
       from pixman/master
      
      POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
      
      cairo trimmed benchmarks :
      
      Speedups
      ========
      t-firefox-asteroids  533.92  -> 489.94 :  1.09x
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      47f74ca9
    • Oded Gabbay's avatar
      vmx: implement fast path scaled nearest vmx_8888_8888_OVER · fcbb97d4
      Oded Gabbay authored
      It was benchmarked against commid id 2be523b2
      
       from pixman/master
      
      POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
      reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
      
                      Before           After           Change
                    ---------------------------------------------
      L1              134.36          181.68          +35.22%
      L2              135.07          180.67          +33.76%
      M               134.6           180.51          +34.11%
      HT              121.77          128.79          +5.76%
      VT              120.49          145.07          +20.40%
      R               93.83           102.3           +9.03%
      RT              50.82           46.93           -7.65%
      Kops/s          448             422             -5.80%
      
      cairo trimmed benchmarks :
      
      Speedups
      ========
      t-firefox-asteroids  533.92 -> 497.92 :  1.07x
          t-midori-zoomed  692.98 -> 651.24 :  1.06x
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      fcbb97d4
    • Oded Gabbay's avatar
      vmx: implement fast path vmx_composite_src_x888_8888 · ad612c42
      Oded Gabbay authored
      It was benchmarked against commid id 2be523b2
      
       from pixman/master
      
      POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
      reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
      
                      Before           After           Change
                    ---------------------------------------------
      L1              1115.4          5006.49         +348.85%
      L2              1112.26         4338.01         +290.02%
      M               1110.54         2524.15         +127.29%
      HT              745.41          1140.03         +52.94%
      VT              749.03          1287.13         +71.84%
      R               423.91          547.6           +29.18%
      RT              205.79          194.98          -5.25%
      Kops/s          1414            1361            -3.75%
      
      cairo trimmed benchmarks :
      
      Speedups
      ========
      t-gnome-system-monitor  1402.62  -> 1212.75 :  1.16x
         t-firefox-asteroids   533.92  ->  474.50 :  1.13x
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      ad612c42
    • Oded Gabbay's avatar
      vmx: implement fast path vmx_composite_over_n_8888_8888_ca · fafc1d40
      Oded Gabbay authored
      It was benchmarked against commid id 2be523b2
      
       from pixman/master
      
      POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
      
      reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)
      
                      Before           After           Change
                    ---------------------------------------------
      L1              61.92            244.91          +295.53%
      L2              62.74            243.3           +287.79%
      M               63.03            241.94          +283.85%
      HT              59.91            144.22          +140.73%
      VT              59.4             174.39          +193.59%
      R               53.6             111.37          +107.78%
      RT              37.99            46.38           +22.08%
      Kops/s          436              506             +16.06%
      
      cairo trimmed benchmarks :
      
      Speedups
      ========
      t-xfce4-terminal-a1  1540.37 -> 1226.14 :  1.26x
      t-firefox-talos-gfx  1488.59 -> 1209.19 :  1.23x
      
      Slowdowns
      =========
              t-evolution  553.88  -> 581.63  :  1.05x
                t-poppler  364.99  -> 383.79  :  1.05x
      t-firefox-scrolling  1223.65 -> 1304.34 :  1.07x
      
      The slowdowns can be explained in cases where the images are small and
      un-aligned to 16-byte boundary. In that case, the function will first
      work on the un-aligned area, even in operations of 1 byte. In case of
      small images, the overhead of such operations can be more than the
      savings we get from using the vmx instructions that are done on the
      aligned part of the image.
      
      In the C fast-path implementation, there is no special treatment for the
      un-aligned part, as it works in 4 byte quantities on the entire image.
      
      Because llbb is a synthetic test, I would assume it has much less
      alignment issues than "real-world" scenario, such as cairo benchmarks,
      which are basically recorded traces of real application activity.
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      fafc1d40
    • Oded Gabbay's avatar
      vmx: implement fast path composite_add_8888_8888 · a3e91440
      Oded Gabbay authored
      Copied impl. from sse2 file and edited to use vmx functions
      
      It was benchmarked against commid id 2be523b2
      
       from pixman/master
      
      POWER8, 16 cores, 3.4GHz, ppc64le :
      
      reference memcpy speed = 27036.4MB/s (6759.1MP/s for 32bpp fills)
      
                      Before           After           Change
                    ---------------------------------------------
      L1              248.76          3284.48         +1220.34%
      L2              264.09          2826.47         +970.27%
      M               261.24          2405.06         +820.63%
      HT              217.27          857.3           +294.58%
      VT              213.78          980.09          +358.46%
      R               176.61          442.95          +150.81%
      RT              107.54          150.08          +39.56%
      Kops/s          917             1125            +22.68%
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      a3e91440
    • Oded Gabbay's avatar
      vmx: implement fast path composite_add_8_8 · d5b5343c
      Oded Gabbay authored
      Copied impl. from sse2 file and edited to use vmx functions
      
      It was benchmarked against commid id 2be523b2
      
       from pixman/master
      
      POWER8, 16 cores, 3.4GHz, ppc64le :
      
      reference memcpy speed = 27036.4MB/s (6759.1MP/s for 32bpp fills)
      
                      Before           After           Change
                    ---------------------------------------------
      L1              687.63          9140.84         +1229.33%
      L2              715             7495.78         +948.36%
      M               717.39          8460.14         +1079.29%
      HT              569.56          1020.12         +79.11%
      VT              520.3           1215.56         +133.63%
      R               514.81          874.35          +69.84%
      RT              341.28          305.42          -10.51%
      Kops/s          1621            1579            -2.59%
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      d5b5343c
    • Oded Gabbay's avatar
      vmx: implement fast path composite_over_8888_8888 · 339eeaf0
      Oded Gabbay authored
      Copied impl. from sse2 file and edited to use vmx functions
      
      It was benchmarked against commid id 2be523b2
      
       from pixman/master
      
      POWER8, 16 cores, 3.4GHz, ppc64le :
      
      reference memcpy speed = 27036.4MB/s (6759.1MP/s for 32bpp fills)
      
                      Before           After           Change
                    ---------------------------------------------
      L1              129.47          1054.62         +714.57%
      L2              138.31          1011.02         +630.98%
      M               139.99          1008.65         +620.52%
      HT              122.11          468.45          +283.63%
      VT              121.06          532.21          +339.62%
      R               108.48          240.5           +121.70%
      RT              77.87           116.7           +49.87%
      Kops/s          758             981             +29.42%
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      339eeaf0
    • Oded Gabbay's avatar
      vmx: implement fast path vmx_fill · 0cc8a2e9
      Oded Gabbay authored
      Based on sse2 impl.
      
      It was benchmarked against commid id e2d211ac
      
       from pixman/master
      
      Tested cairo trimmed benchmarks on POWER8, 8 cores, 3.4GHz,
      RHEL 7.1 ppc64le :
      
      speedups
      ========
           t-swfdec-giant-steps  1383.09 ->  718.63  :  1.92x speedup
         t-gnome-system-monitor  1403.53 ->  918.77  :  1.53x speedup
                    t-evolution  552.34  ->  415.24  :  1.33x speedup
            t-xfce4-terminal-a1  1573.97 ->  1351.46 :  1.16x speedup
            t-firefox-paintball  847.87  ->  734.50  :  1.15x speedup
            t-firefox-asteroids  565.99  ->  492.77  :  1.15x speedup
      t-firefox-canvas-swscroll  1656.87 ->  1447.48 :  1.14x speedup
                t-midori-zoomed  724.73  ->  642.16  :  1.13x speedup
         t-firefox-planet-gnome  975.78  ->  911.92  :  1.07x speedup
                t-chromium-tabs  292.12  ->  274.74  :  1.06x speedup
           t-firefox-chalkboard  690.78  ->  653.93  :  1.06x speedup
            t-firefox-talos-gfx  1375.30 ->  1303.74 :  1.05x speedup
         t-firefox-canvas-alpha  1016.79 ->  967.24  :  1.05x speedup
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      0cc8a2e9
    • Oded Gabbay's avatar
      vmx: add helper functions · c12ee950
      Oded Gabbay authored
      
      
      This patch adds the following helper functions for reuse of code,
      hiding BE/LE differences and maintainability.
      
      All of the functions were defined as static force_inline.
      
      Names were copied from pixman-sse2.c so conversion of fast-paths between
      sse2 and vmx would be easier from now on. Therefore, I tried to keep the
      input/output of the functions to be as close as possible to the sse2
      definitions.
      
      The functions are:
      
      - load_128_aligned       : load 128-bit from a 16-byte aligned memory
                                 address into a vector
      
      - load_128_unaligned     : load 128-bit from memory into a vector,
                                 without guarantee of alignment for the
                                 source pointer
      
      - save_128_aligned       : save 128-bit vector into a 16-byte aligned
                                 memory address
      
      - create_mask_16_128     : take a 16-bit value and fill with it
                                 a new vector
      
      - create_mask_1x32_128   : take a 32-bit pointer and fill a new
                                 vector with the 32-bit value from that pointer
      
      - create_mask_32_128     : take a 32-bit value and fill with it
                                 a new vector
      
      - unpack_32_1x128        : unpack 32-bit value into a vector
      
      - unpacklo_128_16x8      : unpack the eight low 8-bit values of a vector
      
      - unpackhi_128_16x8      : unpack the eight high 8-bit values of a vector
      
      - unpacklo_128_8x16      : unpack the four low 16-bit values of a vector
      
      - unpackhi_128_8x16      : unpack the four high 16-bit values of a vector
      
      - unpack_128_2x128       : unpack the eight low 8-bit values of a vector
                                 into one vector and the eight high 8-bit
                                 values into another vector
      
      - unpack_128_2x128_16    : unpack the four low 16-bit values of a vector
                                 into one vector and the four high 16-bit
                                 values into another vector
      
      - unpack_565_to_8888     : unpack an RGB_565 vector to 8888 vector
      
      - pack_1x128_32          : pack a vector and return the LSB 32-bit of it
      
      - pack_2x128_128         : pack two vectors into one and return it
      
      - negate_2x128           : xor two vectors with mask_00ff (separately)
      
      - is_opaque              : returns whether all the pixels contained in
                                 the vector are opaque
      
      - is_zero                : returns whether the vector equals 0
      
      - is_transparent         : returns whether all the pixels
                                 contained in the vector are transparent
      
      - expand_pixel_8_1x128   : expand an 8-bit pixel into lower 8 bytes of a
                                 vector
      
      - expand_alpha_1x128     : expand alpha from vector and return the new
                                 vector
      
      - expand_alpha_2x128     : expand alpha from one vector and another alpha
                                 from a second vector
      
      - expand_alpha_rev_2x128 : expand a reversed alpha from one vector and
                                 another reversed alpha from a second vector
      
      - pix_multiply_2x128     : do pix_multiply for two vectors (separately)
      
      - over_2x128             : perform over op. on two vectors
      
      - in_over_2x128          : perform in-over op. on two vectors
      
      v2: removed expand_pixel_32_1x128 as it was not used by any function and
      its implementation was erroneous
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      c12ee950
    • Oded Gabbay's avatar
      vmx: add LOAD_VECTOR macro · 03414953
      Oded Gabbay authored
      
      
      This patch adds a macro for loading a single vector.
      It also make the other LOAD_VECTORx macros use this macro as a base so
      code would be re-used.
      
      In addition, I fixed minor coding style issues.
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      Acked-by: default avatarSiarhei Siamashka <siarhei.siamashka@gmail.com>
      03414953
  8. 11 Jul, 2015 1 commit
  9. 06 Jul, 2015 9 commits
  10. 02 Jul, 2015 5 commits
  11. 01 Jun, 2015 4 commits