0005040: AVX implementation of DeriveTangents and NormalizeTangents - The Dark Mod Bugtracker

ID	Project	Category	View Status	Date Submitted	Last Update

0005040	The Dark Mod	Coding	public	11.08.2019 10:05	09.04.2020 04:37

Reporter	stgatilov	Assigned To	stgatilov
Priority	low	Severity	minor	Reproducibility	N/A
Status	resolved	Resolution	fixed
Product Version	TDM 2.07
Target Version	TDM 2.08	Fixed in Version	TDM 2.08

Summary	0005040: AVX implementation of DeriveTangents and NormalizeTangents
Description	We have a basic SIMD implementation with SSE2. It might be cool to have AVX implementation for modern CPUs with 256-bit wide execution units.
Tags	No tags attached.

stgatilov 11.08.2019 10:33 administrator ~0011811	Committed in svn rev 8255. Profiling on Ryzen 1600 shows slight improvement over SSE2 implementation: DeriveTangents: 13.0% -> 12.6% NormalizeTangents: 6.4% -> 3.6% Zen1 CPU has 128-bit execution units, so using AVX over SSE normally yields no speed benefit on it. Execution units became 256-bit wide starting from Haswell / Zen2. So I expect this new implementation to be about 70% faster on such CPUs.

stgatilov 11.08.2019 10:48 administrator ~0011812	The current baseline SSE2 implementations or DeriveTangents and NormalizeTangents work with array-of-structure data layout: they simply load vertices into SSE registers and try to exploit at least some parallelization this way. Of course, AoS layout is known to be ill-suited for vectorization, so this does not work well. (however, this is how vertex data is now stored, and I guess storing them in SoA would be even worse) The new AVX implementation performs almost all computations in SoA layout. It loads data for 8 items (triangles/vertices), transposes this data from AoS to SoA, then computes everything it needs with perfect vectorization. Finally, it transposes the results back from SoA to AoS, and stores them. Of course, there is crazy amount of shuffling in the new code: transposition is done for all input and for all output. One might expect that shuffles become a bottleneck, since new Intel architectures have only one execution port for shuffles. I inspected throughput of the loop on Skylake (using IACA tool). The body of the loop is huge (specially in DeriveTangents), much larger that what CPU scheduler can contain. So one cannot hope that CPU can reorder instructions perfactly. The code has several clusters of same-type instructions (i.e. shuffles), and it is part of the problem. The other part of the problem turned out to be dependencies in computations. The floating point math has high latency: multiplications take 5 cycles. So normalizing a vector is already 24 cycles of latency. If we immediately use the normalized vector for more floating point code, then CPU has nothing to execute during this moment. I had to break dependencies in some places by computing dot product with non-unit normal vector before its normalization. I also tried some more things: 1) Accumulate the last component (tangV_z) by spilling the register to local float[8] array, and using the elements of this array. This resulted in more uops and more cycles. 2) Use SSE/AVX to compute addresses of the 24 vertices. This resulted in less uops but more cycles. I guess because AVX multipliications are high-latency, while mutliplications on GP registers are low-latency. Since new NormalizeTangents is faster even on Ryzen, it might be worth replacing its current SSE2 implementation with a downgraded version of AVX implementation.

stgatilov 11.08.2019 10:50 administrator ~0011813	I have added two cvars for quick testing: com_deriveTangentsAVX com_normalizeTangentsAVX When set to 0, the AVX2 implementation of the corresponding method redirects to its SSE2 equivalent, efficiently disabling the AVX2 version. I'd like to delete these cvars before 2.08 beta starts. I do NOT resolve this issue in order to not forget to delete it =)

stgatilov 09.04.2020 04:36 administrator ~0012347	Removed cvars in svn rev 8662 on release2.08 branch.

Date Modified	Username	Field	Change
11.08.2019 10:05	stgatilov	New Issue
11.08.2019 10:05	stgatilov	Status	new => assigned
11.08.2019 10:05	stgatilov	Assigned To	=> stgatilov
11.08.2019 10:33	stgatilov	Note Added: 0011811
11.08.2019 10:48	stgatilov	Note Added: 0011812
11.08.2019 10:50	stgatilov	Note Added: 0011813
09.04.2020 04:36	stgatilov	Note Added: 0012347
09.04.2020 04:37	stgatilov	Status	assigned => resolved
09.04.2020 04:37	stgatilov	Resolution	open => fixed
09.04.2020 04:37	stgatilov	Fixed in Version	=> TDM 2.08