View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0005040||The Dark Mod||Coding||public||11.08.2019 10:05||11.08.2019 10:50|
|Product Version||TDM 2.07|
|Target Version||TDM 2.08|
|Summary||0005040: AVX implementation of DeriveTangents and NormalizeTangents|
|Description||We have a basic SIMD implementation with SSE2.|
It might be cool to have AVX implementation for modern CPUs with 256-bit wide execution units.
|Tags||No tags attached.|
Committed in svn rev 8255.
Profiling on Ryzen 1600 shows slight improvement over SSE2 implementation:
DeriveTangents: 13.0% -> 12.6%
NormalizeTangents: 6.4% -> 3.6%
Zen1 CPU has 128-bit execution units, so using AVX over SSE normally yields no speed benefit on it.
Execution units became 256-bit wide starting from Haswell / Zen2.
So I expect this new implementation to be about 70% faster on such CPUs.
The current baseline SSE2 implementations or DeriveTangents and NormalizeTangents work with array-of-structure data layout: they simply load vertices into SSE registers and try to exploit at least some parallelization this way. Of course, AoS layout is known to be ill-suited for vectorization, so this does not work well. (however, this is how vertex data is now stored, and I guess storing them in SoA would be even worse)
The new AVX implementation performs almost all computations in SoA layout. It loads data for 8 items (triangles/vertices), transposes this data from AoS to SoA, then computes everything it needs with perfect vectorization. Finally, it transposes the results back from SoA to AoS, and stores them.
Of course, there is crazy amount of shuffling in the new code: transposition is done for all input and for all output.
One might expect that shuffles become a bottleneck, since new Intel architectures have only one execution port for shuffles.
I inspected throughput of the loop on Skylake (using IACA tool).
The body of the loop is huge (specially in DeriveTangents), much larger that what CPU scheduler can contain. So one cannot hope that CPU can reorder instructions perfactly. The code has several clusters of same-type instructions (i.e. shuffles), and it is part of the problem. The other part of the problem turned out to be dependencies in computations. The floating point math has high latency: multiplications take 5 cycles. So normalizing a vector is already 24 cycles of latency. If we immediately use the normalized vector for more floating point code, then CPU has nothing to execute during this moment. I had to break dependencies in some places by computing dot product with non-unit normal vector before its normalization.
I also tried some more things:
1) Accumulate the last component (tangV_z) by spilling the register to local float array, and using the elements of this array. This resulted in more uops and more cycles.
2) Use SSE/AVX to compute addresses of the 24 vertices. This resulted in less uops but more cycles. I guess because AVX multipliications are high-latency, while mutliplications on GP registers are low-latency.
Since new NormalizeTangents is faster even on Ryzen, it might be worth replacing its current SSE2 implementation with a downgraded version of AVX implementation.
I have added two cvars for quick testing:
When set to 0, the AVX2 implementation of the corresponding method redirects to its SSE2 equivalent, efficiently disabling the AVX2 version.
I'd like to delete these cvars before 2.08 beta starts.
I do NOT resolve this issue in order to not forget to delete it =)
|11.08.2019 10:05||stgatilov||New Issue|
|11.08.2019 10:05||stgatilov||Status||new => assigned|
|11.08.2019 10:05||stgatilov||Assigned To||=> stgatilov|
|11.08.2019 10:33||stgatilov||Note Added: 0011811|
|11.08.2019 10:48||stgatilov||Note Added: 0011812|
|11.08.2019 10:50||stgatilov||Note Added: 0011813|