- Feb 13, 2017
-
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
Unlike on Win32, it is not necessary to have underscores before function names. Calling conventions are also different.
-
Siarhei Siamashka authored
This can speed up the tests on travis-ci, and we don't care about the measurements accuracy there.
-
- Feb 09, 2017
-
-
Siarhei Siamashka authored
-
- Apr 19, 2016
-
-
Siarhei Siamashka authored
Add missing include for ioctl() on linux
-
- Apr 18, 2016
-
-
Jan Chren authored
-
- Apr 02, 2016
-
-
Siarhei Siamashka authored
-
- Apr 01, 2016
-
-
Siarhei Siamashka authored
-
- Mar 31, 2016
-
-
Siarhei Siamashka authored
-
- Mar 30, 2016
-
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
There is no need for the -DBENCH_FRAMEBUFFER hack anymore.
-
Siarhei Siamashka authored
Keep doubling the number of loop iterations until the duration of test run exceeds 0.5s and also start from 1 loop iteration instead of 16. In the case if the memory bandwidth is extremely low (for example, running tests with the framebuffer), it makes the test duration reasonable. Also in the case if the memory bandwidth is extremely high, this approach reduces the periodic gettimeofday() calls overhead.
-
- Mar 29, 2016
-
-
Siarhei Siamashka authored
This simplifies the code a lot and allows to use different tables for different use cases.
-
Siarhei Siamashka authored
Because some processors are sensitive to the order of memory accesses, add a few more variants of memory buffer backwards copy which do sequential memory writes in the forward direction inside of each sub-block of certain size. The most interesting sizes of such sub-blocks are 32 and 64 bytes, because they match the most frequently used CPU cache line sizes. Example reports: == ARM Cortex A7 == C copy backwards : 266.5 MB/s C copy backwards (32 byte blocks) : 1015.6 MB/s C copy backwards (64 byte blocks) : 1045.7 MB/s C copy : 1033.3 MB/s == ARM Cortex A15 == C copy backwards : 1438.5 MB/s C copy backwards (32 byte blocks) : 1497.5 MB/s C copy backwards (64 byte blocks) : 2643.2 MB/s C copy : 2985.8 MB/s
-
Siarhei Siamashka authored
This is expected to test the ability to do write combining for scattered writes and detect any possible performance penalties. Example reports: == ARM Cortex A7 == C fill : 4011.5 MB/s C fill (shuffle within 16 byte blocks) : 4112.2 MB/s (0.3%) C fill (shuffle within 32 byte blocks) : 333.9 MB/s C fill (shuffle within 64 byte blocks) : 336.6 MB/s == ARM Cortex A15 == C fill : 6065.2 MB/s (0.4%) C fill (shuffle within 16 byte blocks) : 2152.0 MB/s C fill (shuffle within 32 byte blocks) : 2150.7 MB/s C fill (shuffle within 64 byte blocks) : 2238.2 MB/s == ARM Cortex A53 == C fill : 3080.8 MB/s (0.2%) C fill (shuffle within 16 byte blocks) : 3080.7 MB/s C fill (shuffle within 32 byte blocks) : 3079.2 MB/s C fill (shuffle within 64 byte blocks) : 3080.4 MB/s == Intel Atom N450 == C fill : 1554.9 MB/s C fill (shuffle within 16 byte blocks) : 1554.5 MB/s C fill (shuffle within 32 byte blocks) : 1553.9 MB/s C fill (shuffle within 64 byte blocks) : 1554.4 MB/s See https://github.com/ssvb/tinymembench/issues/7
-
Siarhei Siamashka authored
The C compiler may attempt to reorder read and write operations when accessing the source and destination buffers. So instead of sequential memory accesses we may get something like a "drunk master style" memory access pattern. Certain processors, such as ARM Cortex-A7, do not like such memory access pattern very much and it causes a major performance drop. The actual access pattern is unpredictable and is sensitive to the compiler version, optimization flags and even sometimes on some changes in unrelated parts of source code. So use the volatile keyword for the destination pointer in order to resolve this problem and make C benchmarks more deterministic. See https://github.com/ssvb/tinymembench/issues/7
-
Siarhei Siamashka authored
The old variant was just a copy of aligned_block_copy_pf32.
-
- Mar 14, 2016
-
-
Siarhei Siamashka authored
While every source file already had a MIT license notice, having the LICENSE file in the top level directory is a good idea too. This resolves #5
-
- Sep 24, 2013
-
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
It is disabled by default and can be only activated by compiling the benchmark with -DBENCH_FRAMBUFFER in CFLAGS. Basically it can be used to check how the processor can handle uncached reads (assuming integrated GPU and the framebuffer in the system memory).
-
- Sep 23, 2013
-
-
Siarhei Siamashka authored
-
- Jul 06, 2013
-
-
Siarhei Siamashka authored
Use a fixed prefetch distance 512 bytes for everything. Also make sure that 32 byte prefetch step really means a single PLD instruction executed per 32 byte data chunk. And likewise for 64 byte prefetch. We don't care too much about achieving peak performance, consistency and predictability is more important.
-
Siarhei Siamashka authored
This benchmark exposes some problems or misconfiguration in Allwinner A20 (Cortex-A7) memory subsystem. If we do reads from two separate buffers at once, then the performance drops quite significantly. It is documented that the automatic prefetcher in Cortex-A7 can only track a single data stream: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464f/CHDIGCEB.html However there is still something wrong even when using explicit PLD prefetch. Here are some results: NEON read : 1256.7 MB/s NEON read prefetched (32 bytes step) : 1346.4 MB/s (0.4%) NEON read prefetched (64 bytes step) : 1439.6 MB/s (0.4%) NEON read 2 data streams : 371.3 MB/s (0.3%) NEON read 2 data streams prefetched (32 bytes step) : 687.5 MB/s (0.5%) NEON read 2 data streams prefetched (64 bytes step) : 703.2 MB/s (0.4%) Normally we would expect that the memory bandwidth should remain roughly the same no matter how many data streams we are reading at once. But even reading two data streams is enough to demonstrate big troubles. Being able to simultaneously read from multiple data streams efficiently is important for 2D graphics (alpha blending), colorspace conversion (Planar YUV -> packed RGB) and many other things.
-
- Jul 02, 2013
-
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
Now we try to run two rounds of test: one with huge pages explicitly disabled, and another one with huge pages enabled. Additionally, the minimal block size used for latency benchmarks is now 1024. Testing smaller blocks is just a waste of time.
-
- Jun 25, 2013
-
-
Siarhei Siamashka authored
Just select a random offset in order to mitigate the unpredictability of cache associativity effects when dealing with different physical memory fragmentation (for PIPT caches). We are reporting the "best" measured latency, some offsets may be better than the others.
-
- Mar 23, 2013
-
-
Siarhei Siamashka authored
/tmp/ccej9DYL.s:47: Rd and Rm should be different in mla (repeated) /tmp/ccej9DYL.s:754: Rd and Rm should be different in mla (repeated) /tmp/ccej9DYL.s:720: Error: bad immediate value for offset (5328) /tmp/ccej9DYL.s:724: Error: bad immediate value for offset (5316) /tmp/ccej9DYL.s:725: Error: bad immediate value for offset (5316) https://github.com/ssvb/tinymembench/issues/1
-
- Jan 03, 2013
-
-
Siarhei Siamashka authored
-
- Dec 26, 2012
-
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
-
- Dec 23, 2012
-
-
Siarhei Siamashka authored
-
Siarhei Siamashka authored
-