Commits · bd489a518663655fadae597c589cf05cbe9f68b2 · Grant Kim / tinymembench

Feb 13, 2017
- travis: Run the generated binaries using qemu-user and wine · bd489a51
  Siarhei Siamashka authored 8 years ago
  
  bd489a51
- Add Win64 support (via mingw64) · f2f096c0
  Siarhei Siamashka authored 8 years ago
  
  Unlike on Win32, it is not necessary to have underscores before function names. Calling conventions are also different.
  f2f096c0
- Allow to override the number of iterations via compile defines · 9b20cdf1
  Siarhei Siamashka authored 8 years ago
  
  This can speed up the tests on travis-ci, and we don't care about the measurements accuracy there.
  9b20cdf1
Feb 09, 2017
- Enable basic travis-ci build tests (x86 and arm) · b9a72d47
  Siarhei Siamashka authored 8 years ago
  
  b9a72d47
Apr 19, 2016
- Merge pull request #8 from rindeal/ioctl-include · 2c789849
  Siarhei Siamashka authored 8 years ago
  
  Add missing include for ioctl() on linux
  2c789849
Apr 18, 2016
- Add missing include for ioctl() on linux · 35a2d4bd
  Jan Chren authored 8 years ago
  
  35a2d4bd
Apr 02, 2016
- Test different prefetch types on AArch64 · 2d8e0d0a
  Siarhei Siamashka authored 8 years ago
  
  2d8e0d0a
Apr 01, 2016
- Add support for benchmarking framebuffer readback on Android devices · ca092654
  Siarhei Siamashka authored 8 years ago
  
  ca092654
Mar 31, 2016
- Mention arm-linux-gnueabihf-gcc in readme as a more mainstream choice · 3f26a640
  Siarhei Siamashka authored 8 years ago
  
  3f26a640
Mar 30, 2016
- Bump version to 0.4.9 (after v0.4 release) · a9907a52
  Siarhei Siamashka authored 8 years ago
  
  a9907a52
- Bump version to 0.4 · e7d75734
  Siarhei Siamashka authored 8 years ago
  
  e7d75734
- Add initial AArch64 assembly code · 229a7478
  Siarhei Siamashka authored 8 years ago
  
  229a7478
- Do framebuffer read bandwidth tests in Linux builds by default · f45e61b4
  Siarhei Siamashka authored 8 years ago
  
  There is no need for the -DBENCH_FRAMEBUFFER hack anymore.
  f45e61b4
- Better adaptive adjustment of the number of loop iterations · 21b4a910
  Siarhei Siamashka authored 8 years ago
  
  Keep doubling the number of loop iterations until the duration of test run exceeds 0.5s and also start from 1 loop iteration instead of 16. In the case if the memory bandwidth is extremely low (for example, running tests with the framebuffer), it makes the test duration reasonable. Also in the case if the memory bandwidth is extremely high, this approach reduces the periodic gettimeofday() calls overhead.
  21b4a910
Mar 29, 2016

Always use tables with function pointers for bandwidth benchmarks · 05d03c7b
Siarhei Siamashka authored 8 years ago
```
This simplifies the code a lot and allows to use different tables
for different use cases.
```
05d03c7b

New variants of block based C backwards copy · eb1fccd5

Because some processors are sensitive to the order of memory
accesses, add a few more variants of memory buffer backwards
copy which do sequential memory writes in the forward direction
inside of each sub-block of certain size. The most interesting
sizes of such sub-blocks are 32 and 64 bytes, because they match
the most frequently used CPU cache line sizes.

Example reports:

== ARM Cortex A7 ==
 C copy backwards                                     :    266.5 MB/s
 C copy backwards (32 byte blocks)                    :   1015.6 MB/s
 C copy backwards (64 byte blocks)                    :   1045.7 MB/s
 C copy                                               :   1033.3 MB/s

== ARM Cortex A15 ==
 C copy backwards                                     :   1438.5 MB/s
 C copy backwards (32 byte blocks)                    :   1497.5 MB/s
 C copy backwards (64 byte blocks)                    :   2643.2 MB/s
 C copy                                               :   2985.8 MB/s

eb1fccd5

Benchmark reshuffled writes to the destination buffer · ada1db8c

Siarhei Siamashka authored 8 years ago

This is expected to test the ability to do write combining for
scattered writes and detect any possible performance penalties.

Example reports:

== ARM Cortex A7 ==
 C fill                                               :   4011.5 MB/s
 C fill (shuffle within 16 byte blocks)               :   4112.2 MB/s (0.3%)
 C fill (shuffle within 32 byte blocks)               :    333.9 MB/s
 C fill (shuffle within 64 byte blocks)               :    336.6 MB/s

== ARM Cortex A15 ==
 C fill                                               :   6065.2 MB/s (0.4%)
 C fill (shuffle within 16 byte blocks)               :   2152.0 MB/s
 C fill (shuffle within 32 byte blocks)               :   2150.7 MB/s
 C fill (shuffle within 64 byte blocks)               :   2238.2 MB/s

== ARM Cortex A53 ==
 C fill                                               :   3080.8 MB/s (0.2%)
 C fill (shuffle within 16 byte blocks)               :   3080.7 MB/s
 C fill (shuffle within 32 byte blocks)               :   3079.2 MB/s
 C fill (shuffle within 64 byte blocks)               :   3080.4 MB/s

== Intel Atom N450 ==
 C fill                                               :   1554.9 MB/s
 C fill (shuffle within 16 byte blocks)               :   1554.5 MB/s
 C fill (shuffle within 32 byte blocks)               :   1553.9 MB/s
 C fill (shuffle within 64 byte blocks)               :   1554.4 MB/s

See https://github.com/ssvb/tinymembench/issues/7

ada1db8c

Enforce strict order of writes in C benchmarks via volatile keyword · 6fd9baed

Siarhei Siamashka authored 8 years ago

The C compiler may attempt to reorder read and write operations when
accessing the source and destination buffers. So instead of sequential
memory accesses we may get something like a "drunk master style"
memory access pattern. Certain processors, such as ARM Cortex-A7,
do not like such memory access pattern very much and it causes
a major performance drop. The actual access pattern is unpredictable
and is sensitive to the compiler version, optimization flags and
even sometimes on some changes in unrelated parts of source code.

So use the volatile keyword for the destination pointer in order
to resolve this problem and make C benchmarks more deterministic.

See https://github.com/ssvb/tinymembench/issues/7

6fd9baed

Do prefetch via 64 byte steps in aligned_block_copy_pf64 · b40f1c03
Siarhei Siamashka authored 8 years ago
```
The old variant was just a copy of aligned_block_copy_pf32.
```
b40f1c03

Mar 14, 2016

Add the LICENSE file · 853b0c69

Siarhei Siamashka authored 9 years ago

While every source file already had a MIT license notice, having
the LICENSE file in the top level directory is a good idea too.
This resolves #5

853b0c69

Sep 24, 2013

ARM VFP copy benchmark · 95e68477
Siarhei Siamashka authored 11 years ago

95e68477

Experimental code for benchmarking framebuffer (in linux) · 4e0b0949

Siarhei Siamashka authored 11 years ago

It is disabled by default and can be only activated by compiling
the benchmark with -DBENCH_FRAMBUFFER in CFLAGS.

Basically it can be used to check how the processor can handle
uncached reads (assuming integrated GPU and the framebuffer
in the system memory).

4e0b0949

Sep 23, 2013
- Added MOVSB/MOVSD copy performance benchmark for x86 · 11afc31a
  Siarhei Siamashka authored 11 years ago
  
  11afc31a
Jul 06, 2013

Adjust prefetches for NEON read tests · c6411701

Siarhei Siamashka authored 11 years ago

Use a fixed prefetch distance 512 bytes for everything. Also make
sure that 32 byte prefetch step really means a single PLD instruction
executed per 32 byte data chunk. And likewise for 64 byte prefetch.
We don't care too much about achieving peak performance, consistency
and predictability is more important.

c6411701

Added a new "NEON read 2 data streams" benchmark · 84dc0e2e

Siarhei Siamashka authored 11 years ago

This benchmark exposes some problems or misconfiguration in Allwinner A20
(Cortex-A7) memory subsystem. If we do reads from two separate buffers at
once, then the performance drops quite significantly. It is documented that
the automatic prefetcher in Cortex-A7 can only track a single data stream:

    http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464f/CHDIGCEB.html

However there is still something wrong even when using explicit PLD
prefetch. Here are some results:

 NEON read                                            :   1256.7 MB/s
 NEON read prefetched (32 bytes step)                 :   1346.4 MB/s (0.4%)
 NEON read prefetched (64 bytes step)                 :   1439.6 MB/s (0.4%)
 NEON read 2 data streams                             :    371.3 MB/s (0.3%)
 NEON read 2 data streams prefetched (32 bytes step)  :    687.5 MB/s (0.5%)
 NEON read 2 data streams prefetched (64 bytes step)  :    703.2 MB/s (0.4%)

Normally we would expect that the memory bandwidth should remain roughly
the same no matter how many data streams we are reading at once. But even
reading two data streams is enough to demonstrate big troubles.

Being able to simultaneously read from multiple data streams efficiently
is important for 2D graphics (alpha blending), colorspace conversion
(Planar YUV -> packed RGB) and many other things.

84dc0e2e

Jul 02, 2013
- Bump version to 0.3.9 (after v0.3 release) · e2c57697
  Siarhei Siamashka authored 11 years ago
  
  e2c57697
- Bump version to 0.3 · ec6408a6
  Siarhei Siamashka authored 11 years ago
  
  ec6408a6
- Support for Transparent Huge Pages in the latency benchmark · 7e9db85f
  Siarhei Siamashka authored 11 years ago
  
  Now we try to run two rounds of test: one with huge pages explicitly disabled, and another one with huge pages enabled. Additionally, the minimal block size used for latency benchmarks is now 1024. Testing smaller blocks is just a waste of time.
  7e9db85f
Jun 25, 2013

Reduce the effects of cache associativity in the latency test · 009150a5

Siarhei Siamashka authored 11 years ago

Just select a random offset in order to mitigate the unpredictability
of cache associativity effects when dealing with different physical
memory fragmentation (for PIPT caches). We are reporting the "best"
measured latency, some offsets may be better than the others.

009150a5

Mar 23, 2013

Fixed build problems when compiling for armv4/armv5 · 40ad46a5

Siarhei Siamashka authored 11 years ago

/tmp/ccej9DYL.s:47: Rd and Rm should be different in mla (repeated)
/tmp/ccej9DYL.s:754: Rd and Rm should be different in mla (repeated)
/tmp/ccej9DYL.s:720: Error: bad immediate value for offset (5328)
/tmp/ccej9DYL.s:724: Error: bad immediate value for offset (5316)
/tmp/ccej9DYL.s:725: Error: bad immediate value for offset (5316)

https://github.com/ssvb/tinymembench/issues/1

40ad46a5

Jan 03, 2013
- Added link to the wiki page with benchmark results · a20599eb
  Siarhei Siamashka authored 12 years ago
  
  a20599eb
Dec 26, 2012
- Fix a typo · 53c978f1
  Siarhei Siamashka authored 12 years ago
  
  53c978f1
- MIPS32 2-pass copy · 96734699
  Siarhei Siamashka authored 12 years ago
  
  96734699
- Bump version to 0.2.9 · 3d250f28
  Siarhei Siamashka authored 12 years ago
  
  3d250f28
- Rename to 'tinymembench' and v0.2 release · 42afc20c
  Siarhei Siamashka authored 12 years ago
  
  42afc20c
- More explanations for the latency test and improved accuracy · 72a70ed0
  Siarhei Siamashka authored 12 years ago
  
  72a70ed0
- NEON fill backwards · fa402346
  Siarhei Siamashka authored 12 years ago
  
  fa402346
- SSE2 2-pass nontemporal copy · 38c479a7
  Siarhei Siamashka authored 12 years ago
  
  38c479a7
Dec 23, 2012
- Unrolled NEON copy · 65b40986
  Siarhei Siamashka authored 12 years ago
  
  65b40986
- Stddev calculation for memory bandwidth tests · 6a910402
  Siarhei Siamashka authored 12 years ago
  
  6a910402