Skip to content
Snippets Groups Projects
  1. Apr 18, 2016
  2. Apr 02, 2016
  3. Apr 01, 2016
  4. Mar 31, 2016
  5. Mar 30, 2016
  6. Mar 29, 2016
    • Siarhei Siamashka's avatar
      Always use tables with function pointers for bandwidth benchmarks · 05d03c7b
      Siarhei Siamashka authored
      This simplifies the code a lot and allows to use different tables
      for different use cases.
      05d03c7b
    • Siarhei Siamashka's avatar
      New variants of block based C backwards copy · eb1fccd5
      Siarhei Siamashka authored
      Because some processors are sensitive to the order of memory
      accesses, add a few more variants of memory buffer backwards
      copy which do sequential memory writes in the forward direction
      inside of each sub-block of certain size. The most interesting
      sizes of such sub-blocks are 32 and 64 bytes, because they match
      the most frequently used CPU cache line sizes.
      
      Example reports:
      
      == ARM Cortex A7 ==
       C copy backwards                                     :    266.5 MB/s
       C copy backwards (32 byte blocks)                    :   1015.6 MB/s
       C copy backwards (64 byte blocks)                    :   1045.7 MB/s
       C copy                                               :   1033.3 MB/s
      
      == ARM Cortex A15 ==
       C copy backwards                                     :   1438.5 MB/s
       C copy backwards (32 byte blocks)                    :   1497.5 MB/s
       C copy backwards (64 byte blocks)                    :   2643.2 MB/s
       C copy                                               :   2985.8 MB/s
      eb1fccd5
    • Siarhei Siamashka's avatar
      Benchmark reshuffled writes to the destination buffer · ada1db8c
      Siarhei Siamashka authored
      This is expected to test the ability to do write combining for
      scattered writes and detect any possible performance penalties.
      
      Example reports:
      
      == ARM Cortex A7 ==
       C fill                                               :   4011.5 MB/s
       C fill (shuffle within 16 byte blocks)               :   4112.2 MB/s (0.3%)
       C fill (shuffle within 32 byte blocks)               :    333.9 MB/s
       C fill (shuffle within 64 byte blocks)               :    336.6 MB/s
      
      == ARM Cortex A15 ==
       C fill                                               :   6065.2 MB/s (0.4%)
       C fill (shuffle within 16 byte blocks)               :   2152.0 MB/s
       C fill (shuffle within 32 byte blocks)               :   2150.7 MB/s
       C fill (shuffle within 64 byte blocks)               :   2238.2 MB/s
      
      == ARM Cortex A53 ==
       C fill                                               :   3080.8 MB/s (0.2%)
       C fill (shuffle within 16 byte blocks)               :   3080.7 MB/s
       C fill (shuffle within 32 byte blocks)               :   3079.2 MB/s
       C fill (shuffle within 64 byte blocks)               :   3080.4 MB/s
      
      == Intel Atom N450 ==
       C fill                                               :   1554.9 MB/s
       C fill (shuffle within 16 byte blocks)               :   1554.5 MB/s
       C fill (shuffle within 32 byte blocks)               :   1553.9 MB/s
       C fill (shuffle within 64 byte blocks)               :   1554.4 MB/s
      
      See https://github.com/ssvb/tinymembench/issues/7
      ada1db8c
    • Siarhei Siamashka's avatar
      Enforce strict order of writes in C benchmarks via volatile keyword · 6fd9baed
      Siarhei Siamashka authored
      The C compiler may attempt to reorder read and write operations when
      accessing the source and destination buffers. So instead of sequential
      memory accesses we may get something like a "drunk master style"
      memory access pattern. Certain processors, such as ARM Cortex-A7,
      do not like such memory access pattern very much and it causes
      a major performance drop. The actual access pattern is unpredictable
      and is sensitive to the compiler version, optimization flags and
      even sometimes on some changes in unrelated parts of source code.
      
      So use the volatile keyword for the destination pointer in order
      to resolve this problem and make C benchmarks more deterministic.
      
      See https://github.com/ssvb/tinymembench/issues/7
      6fd9baed
    • Siarhei Siamashka's avatar
      Do prefetch via 64 byte steps in aligned_block_copy_pf64 · b40f1c03
      Siarhei Siamashka authored
      The old variant was just a copy of aligned_block_copy_pf32.
      b40f1c03
  7. Mar 14, 2016
    • Siarhei Siamashka's avatar
      Add the LICENSE file · 853b0c69
      Siarhei Siamashka authored
      While every source file already had a MIT license notice, having
      the LICENSE file in the top level directory is a good idea too.
      This resolves #5
      853b0c69
  8. Sep 24, 2013
  9. Sep 23, 2013
  10. Jul 06, 2013
    • Siarhei Siamashka's avatar
      Adjust prefetches for NEON read tests · c6411701
      Siarhei Siamashka authored
      Use a fixed prefetch distance 512 bytes for everything. Also make
      sure that 32 byte prefetch step really means a single PLD instruction
      executed per 32 byte data chunk. And likewise for 64 byte prefetch.
      We don't care too much about achieving peak performance, consistency
      and predictability is more important.
      c6411701
    • Siarhei Siamashka's avatar
      Added a new "NEON read 2 data streams" benchmark · 84dc0e2e
      Siarhei Siamashka authored
      This benchmark exposes some problems or misconfiguration in Allwinner A20
      (Cortex-A7) memory subsystem. If we do reads from two separate buffers at
      once, then the performance drops quite significantly. It is documented that
      the automatic prefetcher in Cortex-A7 can only track a single data stream:
      
          http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464f/CHDIGCEB.html
      
      However there is still something wrong even when using explicit PLD
      prefetch. Here are some results:
      
       NEON read                                            :   1256.7 MB/s
       NEON read prefetched (32 bytes step)                 :   1346.4 MB/s (0.4%)
       NEON read prefetched (64 bytes step)                 :   1439.6 MB/s (0.4%)
       NEON read 2 data streams                             :    371.3 MB/s (0.3%)
       NEON read 2 data streams prefetched (32 bytes step)  :    687.5 MB/s (0.5%)
       NEON read 2 data streams prefetched (64 bytes step)  :    703.2 MB/s (0.4%)
      
      Normally we would expect that the memory bandwidth should remain roughly
      the same no matter how many data streams we are reading at once. But even
      reading two data streams is enough to demonstrate big troubles.
      
      Being able to simultaneously read from multiple data streams efficiently
      is important for 2D graphics (alpha blending), colorspace conversion
      (Planar YUV -> packed RGB) and many other things.
      84dc0e2e
  11. Jul 02, 2013
  12. Jun 25, 2013
  13. Mar 23, 2013
  14. Jan 03, 2013
  15. Dec 26, 2012
  16. Dec 23, 2012
  17. Nov 14, 2012
  18. Apr 24, 2012
Loading