AVX2 implementation

added Doing label

added To Do label and removed Doing label

Compiler emits vzeroupper when code has both SSE and AVX in spite of SSE and AVX codes are not be mixed together at the runtime by the code dispatcher -- which makes vzeroupper is completely unnecessary.

Checklists:

Does AVX codes run slower than SSE because of vzeroupper?
- If it does: Since MSVC have no option to turn vzeroupper emitting off, Pure assembly code is necessary by using dedicated assemblers like NASM.
  - Pros: Purely generates code. Literally. (Near)Zero compiler pollutions.
  - Cons: Another Build dependency. And separated code reduce code readability and maintainability.
- GCC has -fno-vzeroupper. So no problem.

Benchmark

Initial implementation

A.K.A. Naive edition

Only hot spots are rewritten in AVX. Rests are written in C.
All SSE codes are removed and only equivalent AVX parts are implemented to make the test reasonable.

YUY2->YUV422 Transform

Release build

Optimization on, flags: -O2 -mavx2

Scalar: 80.5fps
SSE: 197.2fps
AVX: 171.7fps

Debug build

Optimization off, flags: -O0 -mavx2

Scalar: 27.6fps
SSE: 32.2fps
AVX: 36.0fps

Problem analysis

First of all, vzeroupper does not hurt execution speeds. Only marginal. (<0.5fps)

Without any optimization, AVX codes run faster than SSE. However, it runs slower with optimizations (Release build).
Fast loop sections are not part of slowdowns; EncodeQuantLongruns() runs slower on AVX build.

      for (; index < width; index++) // 5.66 sec
      {
        if (rowptr[index] == 0) // 3.18 sec
          count++;
        else
          break;
      }

AVX2 time: 5.924sec
SSE2 time: 3.666sec

More CPU times on a release build:

Function	SSE	AVX	AVX Stream
QuantizeRow16sTo16s	0.111	0.048	0.094
UnpackRowYUV16s	0.218	0.095	0.140
FilterSpatialYUVQuant16s	0.016	0.220	0.062
FilterHorizontalRow16s	0.111	0.269	0.679

All units are seconds.

SSE: SSE codes
AVX: AVX codes
AVX Stream: AVX but vmovntdqa vmovntdq instead of vmovdqa vmovdqa

Memory bandwidth

Some subroutines are quite simple and short hence there are no speed improvements due to load and store latency. Current development platforms (2990WX, 3700X, 2700X and i7-6820HQ) has only 128-bit of memory bandwidth and its memory bottleneck takes most of processing time than actual computing.

AMD Ryzen Threadripper 2990WX

Especially Ryzen TR 2990WX has 128-bit memory bandwidth. They are not 256-bit because of NUMA.

added Doing label and removed To Do label