View Issue Details

IDProjectCategoryView StatusLast Update
0005040The Dark ModCodingpublic09.04.2020 04:37
Reporterstgatilov Assigned Tostgatilov  
Status resolvedResolutionfixed 
Product VersionTDM 2.07 
Target VersionTDM 2.08Fixed in VersionTDM 2.08 
Summary0005040: AVX implementation of DeriveTangents and NormalizeTangents
DescriptionWe have a basic SIMD implementation with SSE2.
It might be cool to have AVX implementation for modern CPUs with 256-bit wide execution units.
TagsNo tags attached.




11.08.2019 10:33

administrator   ~0011811

Committed in svn rev 8255.

Profiling on Ryzen 1600 shows slight improvement over SSE2 implementation:
  DeriveTangents: 13.0% -> 12.6%
  NormalizeTangents: 6.4% -> 3.6%

Zen1 CPU has 128-bit execution units, so using AVX over SSE normally yields no speed benefit on it.
Execution units became 256-bit wide starting from Haswell / Zen2.
So I expect this new implementation to be about 70% faster on such CPUs.


11.08.2019 10:48

administrator   ~0011812

The current baseline SSE2 implementations or DeriveTangents and NormalizeTangents work with array-of-structure data layout: they simply load vertices into SSE registers and try to exploit at least some parallelization this way. Of course, AoS layout is known to be ill-suited for vectorization, so this does not work well. (however, this is how vertex data is now stored, and I guess storing them in SoA would be even worse)
The new AVX implementation performs almost all computations in SoA layout. It loads data for 8 items (triangles/vertices), transposes this data from AoS to SoA, then computes everything it needs with perfect vectorization. Finally, it transposes the results back from SoA to AoS, and stores them.

Of course, there is crazy amount of shuffling in the new code: transposition is done for all input and for all output.
One might expect that shuffles become a bottleneck, since new Intel architectures have only one execution port for shuffles.
I inspected throughput of the loop on Skylake (using IACA tool).

The body of the loop is huge (specially in DeriveTangents), much larger that what CPU scheduler can contain. So one cannot hope that CPU can reorder instructions perfactly. The code has several clusters of same-type instructions (i.e. shuffles), and it is part of the problem. The other part of the problem turned out to be dependencies in computations. The floating point math has high latency: multiplications take 5 cycles. So normalizing a vector is already 24 cycles of latency. If we immediately use the normalized vector for more floating point code, then CPU has nothing to execute during this moment. I had to break dependencies in some places by computing dot product with non-unit normal vector before its normalization.

I also tried some more things:
1) Accumulate the last component (tangV_z) by spilling the register to local float[8] array, and using the elements of this array. This resulted in more uops and more cycles.
2) Use SSE/AVX to compute addresses of the 24 vertices. This resulted in less uops but more cycles. I guess because AVX multipliications are high-latency, while mutliplications on GP registers are low-latency.

Since new NormalizeTangents is faster even on Ryzen, it might be worth replacing its current SSE2 implementation with a downgraded version of AVX implementation.


11.08.2019 10:50

administrator   ~0011813

I have added two cvars for quick testing:
When set to 0, the AVX2 implementation of the corresponding method redirects to its SSE2 equivalent, efficiently disabling the AVX2 version.
I'd like to delete these cvars before 2.08 beta starts.
I do NOT resolve this issue in order to not forget to delete it =)


09.04.2020 04:36

administrator   ~0012347

Removed cvars in svn rev 8662 on release2.08 branch.

Issue History

Date Modified Username Field Change
11.08.2019 10:05 stgatilov New Issue
11.08.2019 10:05 stgatilov Status new => assigned
11.08.2019 10:05 stgatilov Assigned To => stgatilov
11.08.2019 10:33 stgatilov Note Added: 0011811
11.08.2019 10:48 stgatilov Note Added: 0011812
11.08.2019 10:50 stgatilov Note Added: 0011813
09.04.2020 04:36 stgatilov Note Added: 0012347
09.04.2020 04:37 stgatilov Status assigned => resolved
09.04.2020 04:37 stgatilov Resolution open => fixed
09.04.2020 04:37 stgatilov Fixed in Version => TDM 2.08