Compiler emits vzeroupper when code has both SSE and AVX in spite of SSE and AVX codes are not be mixed together at the runtime by the code dispatcher -- which makes vzeroupper is completely unnecessary.
Checklists:
Does AVX codes run slower than SSE because of vzeroupper?
If it does: Since MSVC have no option to turn vzeroupper emitting off, Pure assembly code is necessary by using dedicated assemblers like NASM.
Only hot spots are rewritten in AVX. Rests are written in C.
All SSE codes are removed and only equivalent AVX parts are implemented to make the test reasonable.
YUY2->YUV422 Transform
Release build
Optimization on, flags: -O2 -mavx2
Scalar: 80.5fps
SSE: 197.2fps
AVX: 171.7fps
Debug build
Optimization off, flags: -O0 -mavx2
Scalar: 27.6fps
SSE: 32.2fps
AVX: 36.0fps
Problem analysis
First of all, vzeroupper does not hurt execution speeds. Only marginal. (<0.5fps)
Without any optimization, AVX codes run faster than SSE. However, it runs slower with optimizations (Release build).
Fast loop sections are not part of slowdowns; EncodeQuantLongruns() runs slower on AVX build.
SSE: SSE codes
AVX: AVX codes
AVX Stream: AVX but vmovntdqavmovntdq instead of vmovdqavmovdqa
Memory bandwidth
Some subroutines are quite simple and short hence there are no speed improvements due to load and store latency. Current development platforms (2990WX, 3700X, 2700X and i7-6820HQ) has only 128-bit of memory bandwidth and its memory bottleneck takes most of processing time than actual computing.
AMD Ryzen Threadripper 2990WX
Especially Ryzen TR 2990WX has 128-bit memory bandwidth. They are not 256-bit because of NUMA.